Assignment 04: Tidy data and joins

Set Up
Exercise 1: Univariate Data Reshaping
- Univariate Option 1
Exercise 2: Multivariate Data Reshaping
- Multivariate Option 1
Exercise 3: Table Joins

Set Up

Load required packages:

suppressMessages(library(tidyverse))
suppressMessages(library(gapminder))
suppressMessages(library(DT))

Exercise 1: Univariate Data Reshaping

Choose EITHER “Univariate Option 1” or “Univariate Option 2”. Both of these problems have three components:

Putting data in wider format;
Producing a plot of the wide data;
Re-lengthening the wider data.

You are expected to use pivot_wider() and pivot_longer() for reshaping, and ggplot2 for the plot.

Regarding the plot:

Don’t worry about plotting all of the data from the wide tibble.
The plot should be something that someone might actually want to produce in an analysis.

Univariate Option 1

1. Make a tibble with one row per year, and columns for life expectancy for two or more countries.

widened <- gapminder %>% 
  pivot_wider(id_cols = year, names_from = "country", values_from = lifeExp)

Each row in this table corresponds to a single year, and there is a column for each of the 142 countries containing the life expectancy in years.
There is a total of only 12 rows in this tibble, as there are only 12 years in the gapminder data set.

NOTE: this datatable can be scrolled horizontally to reveal more columns!

2. Take advantage of this new data shape to scatterplot life expectancy for one country against that of another.

widened_plot <- widened %>%
  ggplot(aes(Canada, China)) +
  geom_point(aes(color = as.factor(year))) +
  ggtitle("Life Expectancy from 1952 to 2007") +
  ylab("Life Expectancy in China (years)") +
  xlab("Life Expectancy in Canada (years)") +
  scale_color_discrete("Year") +
  theme(text = element_text(size=18))

China consistently has a lower life expectancy than Canada in every year.
Both countries have a steadily increasing life expectancy overall, except for China in 1962, where the China’s life expectancy dropped substantially.

3. Re-lengthen the data.

lengthened <- widened %>%
  pivot_longer(cols = -year, names_to = "country", values_to = "lifeExp")

The relengthened tibble will only contain year, country, lifeExp (since the other columns continent, gdpPercap and pop were dropped during the original pivot_wider()), with each combination of country and year in each row.

Exercise 2: Multivariate Data Reshaping

Choose EITHER “Multivariate Option 1” or “Multivariate Option 2”. All of these problems have two components:

Putting data in wider format;
Re-lengthening the data.

Don’t worry about producing a plot here. You are expected to use pivot_wider() and pivot_longer() for reshaping.

Multivariate Option 1

1. Make a tibble with one row per year, and columns for life expectancy and GDP per capita (or two other numeric variables) for two or more countries.

widened <- gapminder %>%
  pivot_wider(id_cols  = year, 
              names_from = country,
              names_sep = "_", 
              values_from = c("lifeExp", "gdpPercap"))

This widened tibble contains a column for lifeExp and gdpPercap of all 142 countries for each row, meaning a total of 285 columns.
This tibble has 12 rows, as that is the number of unique years in the gapminder dataset.

NOTE: this datatable can be scrolled horizontally to reveal more columns!

2. Re-lengthen the data.

lengthened <- widened %>%
  pivot_longer(cols = -year,
               names_to = c(".value", "country"),
               names_sep = "_")

The relengthened tibble below will only contain year, country, lifeExp and gdpPercap (since the other columns continent and pop were dropped during the original pivot_wider()), with each combination of country and year in each row.

Exercise 3: Table Joins

Do ALL of the activities in this section.

Read in the made-up wedding guestlist and email addresses using the following lines (go ahead and copy-paste these):

guest <- suppressMessages(
  read_csv("https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/data/wedding/attend.csv")
  )

email <- suppressMessages(
  read_csv("https://raw.githubusercontent.com/STAT545-UBC/Classroom/master/data/wedding/emails.csv")
  )

Then, complete the following tasks using the tidyverse (tidyr, dplyr, …). No need to do any pivoting – feel free to leave guest in its current format.

Exercise 3.1

For each guest in the guestlist (`guest` tibble), add a column for email address, which can be found in the `email` tibble.

In order to join these two tibbles, the email tibble must be separated by guest as opposed to by party, using separate_rows().
Then, the guest column in email must be renamed to name, as it appears in the guest tibble, using rename().

email_sep <- email %>%
  separate_rows(email, guest, sep = ", ") %>%
  rename(name = guest)

Then, since additional information is to be added to the original tibble, we employ the use of left_join().

guest_email <- left_join(guest,
          email_sep,
          by = "name"
)

This datatable contains all guests on the guest list, and an email address if one was provided.

NOTE: this datatable can be scrolled horizontally to reveal more columns!

Exercise 3.2

Who do we have emails for, yet are not on the guestlist?

Using anti_join(), guests in email_sep who are also in guest will be dropped.

email_not_guest <- anti_join(email_sep,guest, by = "name")

There are only three people we have emails for that are not on the guest list.
Since the original email tibble was processed to be separated by name instead of party, it is possible that other members of the party are on the guest list but these three are not.
After double checking, it is revealed that these three below belong to two separate parties, and none of the other party members were on the original guest list either.

Exercise 3.3

Make a guestlist that includes everyone we have emails for (in addition to those on the original guestlist).

The tibble should combine all guests in the guest list, and all guests in email_sep, by using full_join().

all <- full_join(guest, email_sep, by = "name")

The three people with emails but aren’t on the guest list have been appended to the end of the tibble.
They have NA values for almost every column except for name and email.

Assignment 04: Tidy data and joins

Diana Lin

Set Up

Exercise 1: Univariate Data Reshaping

Univariate Option 1

1. Make a tibble with one row per year, and columns for life expectancy for two or more countries.

NOTE: this datatable can be scrolled horizontally to reveal more columns!

2. Take advantage of this new data shape to scatterplot life expectancy for one country against that of another.

3. Re-lengthen the data.

Exercise 2: Multivariate Data Reshaping

Multivariate Option 1

1. Make a tibble with one row per year, and columns for life expectancy and GDP per capita (or two other numeric variables) for two or more countries.

NOTE: this datatable can be scrolled horizontally to reveal more columns!

2. Re-lengthen the data.

Exercise 3: Table Joins

Exercise 3.1

For each guest in the guestlist (guest tibble), add a column for email address, which can be found in the email tibble.

NOTE: this datatable can be scrolled horizontally to reveal more columns!

Exercise 3.2

Who do we have emails for, yet are not on the guestlist?

Exercise 3.3

Make a guestlist that includes everyone we have emails for (in addition to those on the original guestlist).

NOTE: this datatable can be scrolled horizontally to reveal more columns!

For each guest in the guestlist (`guest` tibble), add a column for email address, which can be found in the `email` tibble.