Load the necessary packages for this assignment:
library(gapminder)
library(tidyverse)
library(knitr)
library(DT)
dplyr
Use filter()
to subset the gapminder
data to three countries of your choice in the 1970s.
gapminder %>%
filter(year >= 1970 &
year < 1980,
country == "Canada" |
country == "China" |
country == "Japan") %>%
kable()
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Canada | Americas | 1972 | 72.88000 | 22284500 | 18970.5709 |
Canada | Americas | 1977 | 74.21000 | 23796400 | 22090.8831 |
China | Asia | 1972 | 63.11888 | 862030000 | 676.9001 |
China | Asia | 1977 | 63.96736 | 943455000 | 741.2375 |
Japan | Asia | 1972 | 73.42000 | 107188273 | 14778.7864 |
Japan | Asia | 1977 | 75.38000 | 113872473 | 16610.3770 |
Use the pipe operator %>%
to select country
and gdpPercap
from your filtered dataset in 1.1.
gapminder %>%
filter(year >= 1970 &
year < 1980,
country == "Canada" |
country == "China" |
country == "Japan") %>%
select("country", "gdpPercap") %>%
kable()
country | gdpPercap |
---|---|
Canada | 18970.5709 |
Canada | 22090.8831 |
China | 676.9001 |
China | 741.2375 |
Japan | 14778.7864 |
Japan | 16610.3770 |
Filter gapminder to all entries that have experienced a drop in life expectancy. Be sure to include a new variable that’s the increase in life expectancy in your tibble. Hint: you might find the lag()
or diff()
functions useful.
gapminder %>%
arrange(year) %>%
group_by(country) %>%
mutate(lifeExpInc = lifeExp - lag(lifeExp)) %>%
drop_na() %>%
filter(lifeExpInc < 0) %>%
datatable()
Filter gapminder so that it shows the max GDP per capita experienced by each country. Hint: you might find the max()
function useful here.
gapminder %>%
group_by(country) %>%
summarize(maxGDP = max(gdpPercap)) %>%
datatable()
Produce a scatterplot of Canada’s life expectancy vs. GDP per capita using ggplot2
, without defining a new variable. That is, after filtering the gapminder data set, pipe it directly into the ggplot()
function. Ensure GDP per capita is on a log scale.
gapminder %>%
filter(country == "Canada") %>%
select(lifeExp, gdpPercap) %>%
ggplot(aes(gdpPercap, lifeExp)) +
geom_point() +
scale_x_log10() +
ggtitle("Canada's Life Expectancy vs GDP per Capita") +
ylab("Life Expectancy (years)") +
xlab("GDP per Capita")
dplyr
Pick one categorical variable and one quantitative variable to explore. Answer the following questions in whichever way you think is appropriate, using dplyr
:
country
gapminder %>%
select(country) %>%
unique() %>%
nrow()
## [1] 142
lifeExp
lifeExp
, a range of 0 to the maximum life expectancy in the dataset (rounded to a whole number) would be an appropriate range.gapminder %>%
select(lifeExp) %>%
max() %>%
round()
## [1] 83
# Find the number of countries in this dataset:
gapminder %>%
select(country) %>%
unique() %>%
nrow()
## [1] 142
summary(gapminder) %>%
kable()
country | continent | year | lifeExp | pop | gdpPercap | |
---|---|---|---|---|---|---|
Afghanistan: 12 | Africa :624 | Min. :1952 | Min. :23.60 | Min. :6.001e+04 | Min. : 241.2 | |
Albania : 12 | Americas:300 | 1st Qu.:1966 | 1st Qu.:48.20 | 1st Qu.:2.794e+06 | 1st Qu.: 1202.1 | |
Algeria : 12 | Asia :396 | Median :1980 | Median :60.71 | Median :7.024e+06 | Median : 3531.8 | |
Angola : 12 | Europe :360 | Mean :1980 | Mean :59.47 | Mean :2.960e+07 | Mean : 7215.3 | |
Argentina : 12 | Oceania : 24 | 3rd Qu.:1993 | 3rd Qu.:70.85 | 3rd Qu.:1.959e+07 | 3rd Qu.: 9325.5 | |
Australia : 12 | NA | Max. :2007 | Max. :82.60 | Max. :1.319e+09 | Max. :113523.1 | |
(Other) :1632 | NA | NA | NA | NA | NA |
country
)lifeExp
)range <- gapminder %>%
select(lifeExp) %>%
range()
range[2] - range[1]
## [1] 59.004
country
)lifeExp
)IQR()
):select(gapminder, lifeExp) %>%
summary() %>%
kable()
lifeExp | |
---|---|
Min. :23.60 | |
1st Qu.:48.20 | |
Median :60.71 | |
Mean :59.47 | |
3rd Qu.:70.85 | |
Max. :82.60 |
(thirdq <- summary(gapminder$lifeExp)["3rd Qu."])
## 3rd Qu.
## 70.8455
(firstq <- summary(gapminder$lifeExp)["1st Qu."])
## 1st Qu.
## 48.198
unname(thirdq - firstq)
## [1] 22.6475
IQR(gapminder$lifeExp)
## [1] 22.6475
var()
.gapminder %>%
select(lifeExp) %>%
var()
## lifeExp
## lifeExp 166.8517
ggplot(gapminder, aes(lifeExp)) +
geom_histogram(bins=100) +
ggtitle("Distribution of Life Expectancy") +
xlab("Life Expectancy") +
ylab("Count")
country
can be determined using table()
.gapminder %>%
count(country) %>%
datatable()
Make two plots that have some value to them. That is, plots that someone might actually consider making for an analysis. Just don’t make the same plots we made in class – feel free to use a data set from the datasets
R package if you wish. You don’t have to use all the data in every plot! It’s fine to filter down to one country or a small handful of countries.
gapminder %>%
mutate(million=pop/(10**6)) %>%
ggplot(aes(x=million,y=lifeExp)) +
geom_point(aes(shape = continent, colour = year)) +
ggtitle("Life Expectancy vs Population") +
xlab("Population (millions)") +
ylab("Life Expectancy (years)")
gapminder %>%
filter(country == "Canada" | country == "United States" | country == "Mexico") %>%
ggplot(aes(x = year, y = lifeExp)) +
geom_line(aes(colour = country)) +
ggtitle("Life Expectancy in North America") +
xlab("Year") +
ylab("Life Expectancy (years)")
Evaluate this code and describe the result. Presumably the analyst’s intent was to get the data for Rwanda and Afghanistan. Did they succeed? Why or why not? If not, what is the correct way to do this?
filter(gapminder, country == c("Rwanda", "Afghanistan"))
# The Analyst's Way:
filter(gapminder, country == c("Rwanda", "Afghanistan")) %>%
nrow()
## [1] 12
# The Correct Way:
filter(gapminder, country == "Rwanda" | country == "Afghanistan") %>%
nrow()
## [1] 24
nrow(gapminder)
## [1] 1704
head()
, and filtered using the analyst’s way. As you can see, only 3 of the 6 rows have been filtered out (as observed above, where 50% of the data is missing).head(gapminder) %>%
filter(country == c("Rwanda", "Afghanistan")) %>%
kable()
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
table(
c(
"Afghanistan" == "Rwanda",
"Afghanistan" == "Afghanistan",
"Afghanistan" == "Rwanda",
"Afghanistan" == "Afghanistan",
"Afghanistan" == "Rwanda",
"Afghanistan" == "Afghanistan"
)
) %>%
kable(col.names= c("Boolean", "Frequency"))
Boolean | Frequency |
---|---|
FALSE | 3 |
TRUE | 3 |
table(
c(
"Afghanistan" == c("Rwanda", "Afghanistan"),
"Afghanistan" == c("Rwanda", "Afghanistan"),
"Afghanistan" == c("Rwanda", "Afghanistan"),
"Afghanistan" == c("Rwanda", "Afghanistan"),
"Afghanistan" == c("Rwanda", "Afghanistan"),
"Afghanistan" == c("Rwanda", "Afghanistan")
)
) %>%
kable(col.names= c("Boolean", "Frequency"))
Boolean | Frequency |
---|---|
FALSE | 6 |
TRUE | 6 |
Present numerical tables in a more attractive form using knitr::kable()
for small tibbles (say, up to 10 rows), and DT::datatable()
for larger tibbles.
kable(head(gapminder))
country | continent | year | lifeExp | pop | gdpPercap |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
datatable(gapminder)