Assignment 2

Set Up
Exercise 1: Basic dplyr
Exercise 2: Explore individual variables with dplyr
Exercise 3: Explore various plot types
Recycling (Optional)
Tibble display

Set Up

Load the necessary packages for this assignment:

library(gapminder)
library(tidyverse)
library(knitr)
library(DT)

Exercise 1: Basic `dplyr`

Exercise 1.1

Use filter() to subset the gapminder data to three countries of your choice in the 1970s.

gapminder %>%
  filter(year >= 1970 & 
           year < 1980, 
         country == "Canada" | 
           country == "China" | 
           country == "Japan") %>%
  kable()

country	continent	year	lifeExp	pop	gdpPercap
Canada	Americas	1972	72.88000	22284500	18970.5709
Canada	Americas	1977	74.21000	23796400	22090.8831
China	Asia	1972	63.11888	862030000	676.9001
China	Asia	1977	63.96736	943455000	741.2375
Japan	Asia	1972	73.42000	107188273	14778.7864
Japan	Asia	1977	75.38000	113872473	16610.3770

Exercise 1.2

Use the pipe operator %>% to select country and gdpPercap from your filtered dataset in 1.1.

gapminder %>%
  filter(year >= 1970 & 
           year < 1980, 
       country == "Canada" | 
           country == "China" | 
           country == "Japan") %>%
  select("country", "gdpPercap") %>%
  kable()

country	gdpPercap
Canada	18970.5709
Canada	22090.8831
China	676.9001
China	741.2375
Japan	14778.7864
Japan	16610.3770

Exercise 1.3

Filter gapminder to all entries that have experienced a drop in life expectancy. Be sure to include a new variable that’s the increase in life expectancy in your tibble. Hint: you might find the lag() or diff() functions useful.

gapminder %>%
  arrange(year) %>%
  group_by(country) %>%
  mutate(lifeExpInc = lifeExp - lag(lifeExp)) %>%
  drop_na() %>%
  filter(lifeExpInc < 0) %>%
  datatable()

Exercise 1.4

Filter gapminder so that it shows the max GDP per capita experienced by each country. Hint: you might find the max() function useful here.

gapminder %>%
    group_by(country) %>% 
    summarize(maxGDP = max(gdpPercap)) %>%
    datatable()

Exercise 1.5

Produce a scatterplot of Canada’s life expectancy vs. GDP per capita using ggplot2, without defining a new variable. That is, after filtering the gapminder data set, pipe it directly into the ggplot() function. Ensure GDP per capita is on a log scale.

gapminder %>%
    filter(country == "Canada") %>%
    select(lifeExp, gdpPercap) %>%
    ggplot(aes(gdpPercap, lifeExp)) +
    geom_point() + 
    scale_x_log10() +
    ggtitle("Canada's Life Expectancy vs GDP per Capita") + 
    ylab("Life Expectancy (years)") +
    xlab("GDP per Capita")

Exercise 2: Explore individual variables with `dplyr`

Pick one categorical variable and one quantitative variable to explore. Answer the following questions in whichever way you think is appropriate, using dplyr:

What are possible values (or range, whichever is appropriate) of each variable?

Categorical variable: country
Possible values for the categorical variable are the categories itself. The range would be however many categories there are.
There are 142 countries in this dataset.

gapminder %>%
    select(country) %>%
    unique() %>% 
    nrow()

## [1] 142

Quantitative variable: lifeExp
Possible values are any positive numbers (realistically <= 100). In the case of lifeExp, a range of 0 to the maximum life expectancy in the dataset (rounded to a whole number) would be an appropriate range.
The highest life expectancy is 83 years in this dataset.

gapminder %>%
    select(lifeExp) %>%
    max() %>%
    round()

## [1] 83

What values are typical? What’s the spread? What’s the distribution? Etc., tailored to the variable at hand. Feel free to use summary stats, tables, figures.

Typical Values

A typical value for life expectancy is anywhere from 0 to 100 (realistically), but in this dataset the minimum is 23.60 years and the maximum is 82.60 years.
A typical value for country is any of the 142 countries included in this dataset.

# Find the number of countries in this dataset:
gapminder %>%
  select(country) %>%
  unique() %>%
  nrow()

## [1] 142

summary(gapminder) %>%
    kable()

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan: 12	Africa :624	Min. :1952	Min. :23.60	Min. :6.001e+04	Min. : 241.2
Albania : 12	Americas:300	1st Qu.:1966	1st Qu.:48.20	1st Qu.:2.794e+06	1st Qu.: 1202.1
Algeria : 12	Asia :396	Median :1980	Median :60.71	Median :7.024e+06	Median : 3531.8
Angola : 12	Europe :360	Mean :1980	Mean :59.47	Mean :2.960e+07	Mean : 7215.3
Argentina : 12	Oceania : 24	3rd Qu.:1993	3rd Qu.:70.85	3rd Qu.:1.959e+07	3rd Qu.: 9325.5
Australia : 12	NA	Max. :2007	Max. :82.60	Max. :1.319e+09	Max. :113523.1
(Other) :1632	NA	NA	NA	NA	NA

Spread: Range

Range does not apply to categorical variables (i.e. country)
Range only applies to quantitative variables (i.e. lifeExp)
The range is defined as the difference between the highest and lowest values.
The lowest life expectancy is 23.599 years, and the highest life expectancy is 82.603 years.
The range is 59.004 years.

range <- gapminder %>%
    select(lifeExp) %>%
        range()

range[2] - range[1]

## [1] 59.004

Spread: Interquartile Range

Interquartile range does not apply to categorical variables (i.e. country)
Interquartile range only applies to quantitative variables (i.e. lifeExp)
The first quartile is 48.20 years, the second quartile (the median) is 60.71 years, and the third quartile is 70.85 years.
The interquartile range is the third minus the first quartile (which can also be calculated using the function IQR()):

select(gapminder, lifeExp) %>%
    summary() %>%
    kable()

	lifeExp
	Min. :23.60
	1st Qu.:48.20
	Median :60.71
	Mean :59.47
	3rd Qu.:70.85
	Max. :82.60

(thirdq <- summary(gapminder$lifeExp)["3rd Qu."])

## 3rd Qu. 
## 70.8455

(firstq <- summary(gapminder$lifeExp)["1st Qu."])

## 1st Qu. 
##  48.198

unname(thirdq - firstq)

## [1] 22.6475

IQR(gapminder$lifeExp)

## [1] 22.6475

The interquartile range is 22.6475 years.

Spread: Variance

The variance can be calculated using the function var().
The variance in life expectancy is 166.8517 years.

gapminder %>%
  select(lifeExp) %>%
  var()

##          lifeExp
## lifeExp 166.8517

Distribution

The distribution of life expectancy can be visualized using a histogram.

ggplot(gapminder, aes(lifeExp)) +
  geom_histogram(bins=100) + 
  ggtitle("Distribution of Life Expectancy") +
  xlab("Life Expectancy") +
  ylab("Count")

The distribution of country can be determined using table().
In this case, there are 12 data points (i.e. rows) per country in this dataset.

gapminder %>% 
  count(country) %>%
  datatable()

Exercise 3: Explore various plot types

Make two plots that have some value to them. That is, plots that someone might actually consider making for an analysis. Just don’t make the same plots we made in class – feel free to use a data set from the datasets R package if you wish. You don’t have to use all the data in every plot! It’s fine to filter down to one country or a small handful of countries.

1. A scatterplot of two quantitative variables:

The plot below shows that in Asia, the life expectancy has increased as the population increased, whereas most other continents have a steady increase in life expectancy while the population remains relatively the same.

gapminder %>%
    mutate(million=pop/(10**6)) %>%
    ggplot(aes(x=million,y=lifeExp)) +
    geom_point(aes(shape = continent, colour = year)) +
    ggtitle("Life Expectancy vs Population") +
    xlab("Population (millions)") +
    ylab("Life Expectancy (years)")

2. One other plot besides a scatterplot:

The plot below shows that in North America, the life expectancy has been increasing since 1952, where Mexico, in particular has experienced the greatest increase in life expectancy (a much steeper slope in the graph). Canada has a higher life expectancy than the United States, the rate of increase is very similar.

gapminder %>%
    filter(country == "Canada" | country == "United States" | country == "Mexico") %>%
    ggplot(aes(x = year, y = lifeExp)) +
    geom_line(aes(colour = country)) +
    ggtitle("Life Expectancy in North America") +
    xlab("Year") + 
    ylab("Life Expectancy (years)")

Recycling (Optional)

Evaluate this code and describe the result. Presumably the analyst’s intent was to get the data for Rwanda and Afghanistan. Did they succeed? Why or why not? If not, what is the correct way to do this?

filter(gapminder, country == c("Rwanda", "Afghanistan"))

# The Analyst's Way:
filter(gapminder, country == c("Rwanda", "Afghanistan")) %>% 
  nrow()

## [1] 12

# The Correct Way:
filter(gapminder, country == "Rwanda" | country == "Afghanistan") %>%
  nrow()

## [1] 24

nrow(gapminder)

## [1] 1704

As depicted above, the analyst’s way is missing half the data (since it has half the number of rows than the correct method).
The analyst did NOT succeed, because he or she used the vector in the filtering condition. This vector happens to be of length 2, whereas the gapminder dataset has 1704 rows. So during comparison, each country of each row is compared to this vector of length 2, on repeat.
This process is known as ‘recycling’, where the vector of length 2 is actually recycled– meaning that in essence, the 1704 rows are being compared to 852 sequential copies of this vector of length 2.
To demonstrate, I’ve subsampled the dataset to the first 6 rows using head(), and filtered using the analyst’s way. As you can see, only 3 of the 6 rows have been filtered out (as observed above, where 50% of the data is missing).

head(gapminder) %>%
    filter(country == c("Rwanda", "Afghanistan")) %>%
    kable()

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1977	38.438	14880372	786.1134

This is due to the fact that for this reduced subsample, these are the exact comparisons being made, where only the TRUE ones are being filtered out as matching the criteria.

table(
    c(
        "Afghanistan" == "Rwanda",
        "Afghanistan" == "Afghanistan",
        "Afghanistan" == "Rwanda",
        "Afghanistan" == "Afghanistan",
        "Afghanistan" == "Rwanda",
        "Afghanistan" == "Afghanistan"
    )
) %>% 
    kable(col.names= c("Boolean", "Frequency"))

Boolean	Frequency
FALSE	3
TRUE	3

This is a result of an error in logic. The analyst most likely assumed that the comparison being made at each row of this reduce subsample would be:

table(
    c(
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan"),
        "Afghanistan" == c("Rwanda", "Afghanistan")
    )
) %>% 
    kable(col.names= c("Boolean", "Frequency"))

Boolean	Frequency
FALSE	6
TRUE	6

The analyst did not consider the recycling that R does when comparing vectors are of different lengths.

Tibble display

Present numerical tables in a more attractive form using knitr::kable() for small tibbles (say, up to 10 rows), and DT::datatable() for larger tibbles.

see above exercises as well

kable(head(gapminder))

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1972	36.088	13079460	739.9811
Afghanistan	Asia	1977	38.438	14880372	786.1134

datatable(gapminder)