Set Up

Load the necessary packages for this assignment:

library(gapminder)
library(tidyverse)
library(knitr)
library(DT)

Exercise 1: Basic dplyr

Exercise 1.1

Use filter() to subset the gapminder data to three countries of your choice in the 1970s.

gapminder %>%
  filter(year >= 1970 & 
           year < 1980, 
         country == "Canada" | 
           country == "China" | 
           country == "Japan") %>%
  kable()
country continent year lifeExp pop gdpPercap
Canada Americas 1972 72.88000 22284500 18970.5709
Canada Americas 1977 74.21000 23796400 22090.8831
China Asia 1972 63.11888 862030000 676.9001
China Asia 1977 63.96736 943455000 741.2375
Japan Asia 1972 73.42000 107188273 14778.7864
Japan Asia 1977 75.38000 113872473 16610.3770

Exercise 1.2

Use the pipe operator %>% to select country and gdpPercap from your filtered dataset in 1.1.

gapminder %>%
  filter(year >= 1970 & 
           year < 1980, 
       country == "Canada" | 
           country == "China" | 
           country == "Japan") %>%
  select("country", "gdpPercap") %>%
  kable()
country gdpPercap
Canada 18970.5709
Canada 22090.8831
China 676.9001
China 741.2375
Japan 14778.7864
Japan 16610.3770

Exercise 1.3

Filter gapminder to all entries that have experienced a drop in life expectancy. Be sure to include a new variable that’s the increase in life expectancy in your tibble. Hint: you might find the lag() or diff() functions useful.

gapminder %>%
  arrange(year) %>%
  group_by(country) %>%
  mutate(lifeExpInc = lifeExp - lag(lifeExp)) %>%
  drop_na() %>%
  filter(lifeExpInc < 0) %>%
  datatable()

Exercise 1.4

Filter gapminder so that it shows the max GDP per capita experienced by each country. Hint: you might find the max() function useful here.

gapminder %>%
    group_by(country) %>% 
    summarize(maxGDP = max(gdpPercap)) %>%
    datatable()

Exercise 1.5

Produce a scatterplot of Canada’s life expectancy vs. GDP per capita using ggplot2, without defining a new variable. That is, after filtering the gapminder data set, pipe it directly into the ggplot() function. Ensure GDP per capita is on a log scale.

gapminder %>%
    filter(country == "Canada") %>%
    select(lifeExp, gdpPercap) %>%
    ggplot(aes(gdpPercap, lifeExp)) +
    geom_point() + 
    scale_x_log10() +
    ggtitle("Canada's Life Expectancy vs GDP per Capita") + 
    ylab("Life Expectancy (years)") +
    xlab("GDP per Capita")

Exercise 2: Explore individual variables with dplyr

Pick one categorical variable and one quantitative variable to explore. Answer the following questions in whichever way you think is appropriate, using dplyr:

What are possible values (or range, whichever is appropriate) of each variable?

  • Categorical variable: country
  • Possible values for the categorical variable are the categories itself. The range would be however many categories there are.
  • There are 142 countries in this dataset.
gapminder %>%
    select(country) %>%
    unique() %>% 
    nrow()
## [1] 142
  • Quantitative variable: lifeExp
  • Possible values are any positive numbers (realistically <= 100). In the case of lifeExp, a range of 0 to the maximum life expectancy in the dataset (rounded to a whole number) would be an appropriate range.
  • The highest life expectancy is 83 years in this dataset.
gapminder %>%
    select(lifeExp) %>%
    max() %>%
    round()
## [1] 83

What values are typical? What’s the spread? What’s the distribution? Etc., tailored to the variable at hand. Feel free to use summary stats, tables, figures.

Typical Values
  • A typical value for life expectancy is anywhere from 0 to 100 (realistically), but in this dataset the minimum is 23.60 years and the maximum is 82.60 years.
  • A typical value for country is any of the 142 countries included in this dataset.
# Find the number of countries in this dataset:
gapminder %>%
  select(country) %>%
  unique() %>%
  nrow()
## [1] 142
summary(gapminder) %>%
    kable()
country continent year lifeExp pop gdpPercap
Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60 Min. :6.001e+04 Min. : 241.2
Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20 1st Qu.:2.794e+06 1st Qu.: 1202.1
Algeria : 12 Asia :396 Median :1980 Median :60.71 Median :7.024e+06 Median : 3531.8
Angola : 12 Europe :360 Mean :1980 Mean :59.47 Mean :2.960e+07 Mean : 7215.3
Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
Australia : 12 NA Max. :2007 Max. :82.60 Max. :1.319e+09 Max. :113523.1
(Other) :1632 NA NA NA NA NA
Spread: Range
  • Range does not apply to categorical variables (i.e. country)
  • Range only applies to quantitative variables (i.e. lifeExp)
  • The range is defined as the difference between the highest and lowest values.
  • The lowest life expectancy is 23.599 years, and the highest life expectancy is 82.603 years.
  • The range is 59.004 years.
range <- gapminder %>%
    select(lifeExp) %>%
        range()

range[2] - range[1]
## [1] 59.004
Spread: Interquartile Range
  • Interquartile range does not apply to categorical variables (i.e. country)
  • Interquartile range only applies to quantitative variables (i.e. lifeExp)
  • The first quartile is 48.20 years, the second quartile (the median) is 60.71 years, and the third quartile is 70.85 years.
  • The interquartile range is the third minus the first quartile (which can also be calculated using the function IQR()):
select(gapminder, lifeExp) %>%
    summary() %>%
    kable()
lifeExp
Min. :23.60
1st Qu.:48.20
Median :60.71
Mean :59.47
3rd Qu.:70.85
Max. :82.60
(thirdq <- summary(gapminder$lifeExp)["3rd Qu."])
## 3rd Qu. 
## 70.8455
(firstq <- summary(gapminder$lifeExp)["1st Qu."])
## 1st Qu. 
##  48.198
unname(thirdq - firstq)
## [1] 22.6475
IQR(gapminder$lifeExp)
## [1] 22.6475
  • The interquartile range is 22.6475 years.
Spread: Variance
  • The variance can be calculated using the function var().
  • The variance in life expectancy is 166.8517 years.
gapminder %>%
  select(lifeExp) %>%
  var()
##          lifeExp
## lifeExp 166.8517
Distribution
  • The distribution of life expectancy can be visualized using a histogram.
ggplot(gapminder, aes(lifeExp)) +
  geom_histogram(bins=100) + 
  ggtitle("Distribution of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Count")

  • The distribution of country can be determined using table().
  • In this case, there are 12 data points (i.e. rows) per country in this dataset.
gapminder %>% 
  count(country) %>%
  datatable()

Exercise 3: Explore various plot types

Make two plots that have some value to them. That is, plots that someone might actually consider making for an analysis. Just don’t make the same plots we made in class – feel free to use a data set from the datasets R package if you wish. You don’t have to use all the data in every plot! It’s fine to filter down to one country or a small handful of countries.

1. A scatterplot of two quantitative variables:

  • The plot below shows that in Asia, the life expectancy has increased as the population increased, whereas most other continents have a steady increase in life expectancy while the population remains relatively the same.
gapminder %>%
    mutate(million=pop/(10**6)) %>%
    ggplot(aes(x=million,y=lifeExp)) +
    geom_point(aes(shape = continent, colour = year)) +
    ggtitle("Life Expectancy vs Population") +
    xlab("Population (millions)") +
    ylab("Life Expectancy (years)")

2. One other plot besides a scatterplot:

  • The plot below shows that in North America, the life expectancy has been increasing since 1952, where Mexico, in particular has experienced the greatest increase in life expectancy (a much steeper slope in the graph). Canada has a higher life expectancy than the United States, the rate of increase is very similar.
gapminder %>%
    filter(country == "Canada" | country == "United States" | country == "Mexico") %>%
    ggplot(aes(x = year, y = lifeExp)) +
    geom_line(aes(colour = country)) +
    ggtitle("Life Expectancy in North America") +
    xlab("Year") + 
    ylab("Life Expectancy (years)")

Recycling (Optional)

Evaluate this code and describe the result. Presumably the analyst’s intent was to get the data for Rwanda and Afghanistan. Did they succeed? Why or why not? If not, what is the correct way to do this?

filter(gapminder, country == c("Rwanda", "Afghanistan"))
# The Analyst's Way:
filter(gapminder, country == c("Rwanda", "Afghanistan")) %>% 
  nrow()
## [1] 12
# The Correct Way:
filter(gapminder, country == "Rwanda" | country == "Afghanistan") %>%
  nrow()
## [1] 24
nrow(gapminder)
## [1] 1704
head(gapminder) %>%
    filter(country == c("Rwanda", "Afghanistan")) %>%
    kable()
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1977 38.438 14880372 786.1134
table(
    c(
        "Afghanistan" == "Rwanda",
        "Afghanistan" == "Afghanistan",
        "Afghanistan" == "Rwanda",
        "Afghanistan" == "Afghanistan",
        "Afghanistan" == "Rwanda",
        "Afghanistan" == "Afghanistan"
    )
) %>% 
    kable(col.names= c("Boolean", "Frequency"))
Boolean Frequency
FALSE 3
TRUE 3
table(
    c(
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan")
    )
) %>% 
    kable(col.names= c("Boolean", "Frequency"))
Boolean Frequency
FALSE 6
TRUE 6

Tibble display

Present numerical tables in a more attractive form using knitr::kable() for small tibbles (say, up to 10 rows), and DT::datatable() for larger tibbles.

kable(head(gapminder))
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134
datatable(gapminder)