HW05: Factor and Figure Management

Setup
Exercise 1: Explain the value of the here::here package
Exercise 2: Factor management
Exercise 3: File input/output (I/O)
Exercise 4: Visualization design
Exercise 5: Writing figures to file

Setup

suppressMessages(library(gapminder))
suppressMessages(library(tidyverse))
suppressMessages(library(DT))
suppressMessages(library(knitr))
suppressMessages(library(gridExtra))
suppressMessages(library(here))

Exercise 1: Explain the value of the `here::here` package

Read through the blog post by Malcom Barrett where he outlines why one should use the here::here package in RStudio projects.

Task: In your own words, summarize the value of the `here::here` package in 250 words or fewer.

The here::here package is very valuable for many reasons. One of the biggest reasons is that it maintains compatibility in directory paths across operating systems, where Unix systems use a / but Windows uses \ for the directory hierarchy. Another reason that the here::here package is valuable is its ability to select a directory that acts as a root directory. By using here() you can bypass the relative path system and always provide an absolute path from the root. Furthermore, the root itself will be automatically set to the directory containing an .Rproj file by default, but can be manually set using set_here(), which creates a .here file in the directory being set as the root. The use of the here::here package resolves many conflicts/problems that could arise from hardcoding absolute or relative directory paths when sharing R scripts. This way, collaborators will never have to change hardcoded aspects of the code and can just run the R scripts just as they were received, removing a lot of unnecessary hassle that comes with project collaboration with multiple collaborators. At the same time, using the here::here package increases reproducibility in data analysis, and increases accessibility for script end-users who may not be programmers. To summarize, the here::here package not only facilitates many aspects of project collaboration, it also eases the dilemma of hardcoding absolute or relative paths in any R script, regardless of collaboration.

Exercise 2: Factor management

Task: Choose one dataset (of your choice) and a variable to explore. After ensuring the variable(s) you’re exploring are indeed factors, you should:

Drop factor / levels
Reorder levels based on knowledge from data

We’ve elaborated on these steps for the gapminder and singer data sets below.

Explore the effects of re-leveling a factor in a tibble by:

Comparing the results of arrange on the original and re-leveled factor.
Plotting a figure of before/after re-leveling the factor (make sure to assign the factor to an aesthetic of your choosing).

These explorations should involve the data, the factor levels, and at least two figures (before and after.

Elaboration for the gapminder data set

Drop Oceania. Filter the Gapminder data to remove observations associated with the continent of Oceania. Additionally, remove unused factor levels. Provide concrete information on the data before and after removing these rows and Oceania; address the number of rows and the levels of the affected factors.

Filter `gapminder` data to remove observations associated with the `continent` of Oceania.

After dropping Oceania, the dataframe has 1680 rows, and the continent column has 5 levels, with the 5 levels being the 5 continents, including Oceania.
```
no_oceania <- gapminder %>%
  filter(continent != "Oceania")

nrow(no_oceania)
```
```
## [1] 1680
```
```
nlevels(no_oceania$continent)
```
```
## [1] 5
```
```
levels(no_oceania$continent)
```
```
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
```

Additionally, remove unused factor levels.

After removing unused factors, the number of rows remains the same as after dropping Oceania, but the levels have decreased to 4, now excluding Oceania from the levels.

no_oceania_dropped <- no_oceania %>%
  droplevels()

nrow(no_oceania_dropped)

## [1] 1680

nlevels(no_oceania_dropped$continent)

## [1] 4

levels(no_oceania_dropped$continent)

## [1] "Africa"   "Americas" "Asia"     "Europe"

Provide concrete information on the data before and after removing these rows and Oceania; address the number of rows and the levels of the affected factors.

Gapminder initially had 1704 rows and the continent column initially had 5 levels, where the levels were the 5 continents.

nrow(gapminder)

## [1] 1704

nlevels(gapminder$continent)

## [1] 5

levels(gapminder$continent)

## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Here is a summary table to address the number of rows and the levels of the affected factors:

Data	Gapminder	Oceania Dropped	Unused Factors Dropped
number of rows	1704	1680	1680
number of levels	5	4	4
levels	Africa, Americas, Asia, Europe, Oceania	Africa, Americas, Asia, Europe, Oceania	Africa, Americas, Asia, Europe

The number of levels (and the levels themselves) only decreased when droplevels() was used, but the number of rows decreased when filter() was used.

Reorder the levels of country or continent. Use the forcats package to change the order of the factor levels, based on summarized information of one of the quantitative variables. Consider experimenting with a summary statistic beyond the most basic choice of the mean/median. Use the forcats package in the tidyverse for this, rather than the baseR function as.factor.

Before releveling:

plot <- gapminder %>%
  group_by(continent) %>%
  summarise(var_lifeExp = var(lifeExp)) %>%
  ggplot() +
  geom_col(aes(continent,var_lifeExp)) +
  theme_bw() +
  ylab("Variance of\nLife Expectancy") +
  xlab("Continent") +
  coord_flip() +
  ggtitle("Figure 1")

Using fct_reorder():

fct_plot <- gapminder %>%
  group_by(continent) %>%
  summarise(var_lifeExp = var(lifeExp)) %>%
  ggplot() +
  geom_col(aes(fct_reorder(continent, var_lifeExp),var_lifeExp)) +
  theme_bw() +
  ylab("Variance of\nLife Expectancy") +
  xlab("Continent") +
  coord_flip() +
  ggtitle("Figure 2")

Using arrange():

arrange_plot <- gapminder %>%
  group_by(continent) %>%
  summarise(var_lifeExp = var(lifeExp)) %>%
  arrange(desc(var_lifeExp), .by_group = TRUE) %>%
  ggplot() +
  geom_col(aes(continent,var_lifeExp)) +
  theme_bw() +
  ylab("Variance of\nLife Expectancy") +
  xlab("Continent") +
  coord_flip() +
  ggtitle("Figure 3")

As you can see, arrange() does not relevel the factors like fct_reorder() does, evidenced by no change in the “before releveling” plot and the “using arrange()” plot (i.e. Figure 1 and Figure 3 are identical). However, arrange() does reorder the dataframe itself, just not releveling the factors.
Let’s verify this by checking the dataframes themselves:
- Using fct_reorder():
```
fct <- gapminder %>%
  group_by(continent) %>%
  summarise(var_lifeExp = var(lifeExp))

fct$continent <- fct_reorder(fct$continent, fct$var_lifeExp)
```
  - It seems that, using fct_reorder() does not change the actual ordering of the var_lifeExp column when shown in its dataframe format:
  Dataframe Appearance: fct_reorder()
  
  continent var_lifeExp
  
  Africa 83.72635
  
  Americas 87.33067
  
  Asia 140.76711
  
  Europe 29.51942
  
  Oceania 14.40666
  - But, that it does change the levels, ordering from lowest to highest var_lifeExp:
```
levels(fct$continent)
```
```
## [1] "Oceania"  "Europe"   "Africa"   "Americas" "Asia"
```
  - As a final check, replot using the fct dataframe:
```
fct_df_plot <- ggplot(fct) +
  geom_col(aes(continent,var_lifeExp)) +
  theme_bw() +
  ylab("Variance of\nLife Expectancy") +
  xlab("Continent") +
  coord_flip() +
  ggtitle("Figure 4")
```
  - Figure 2 and Figure 4 are the same, further confirming that fct_reorder() relevels the factors.
- Using arrange():
```
arrange <- gapminder %>%
  group_by(continent) %>%
  summarise(var_lifeExp = var(lifeExp)) %>%
  arrange(var_lifeExp, .by_group = TRUE)
```
  - It seems that, using arrange() does change the actual ordering of the var_lifeExp column:
  Dataframe Appearance: Using arrange()
  
  continent var_lifeExp
  
  Oceania 14.40666
  
  Europe 29.51942
  
  Africa 83.72635
  
  Americas 87.33067
  
  Asia 140.76711
  - But that it does NOT change the levels, as they are still ordered alphabetically instead of by var_lifeExp:
```
levels(arrange$continent)
```
```
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
```
  - As a final check: replot using the arrange dataframe:
```
arrange_df_plot <- ggplot(arrange) +
  geom_col(aes(continent,var_lifeExp)) +
  theme_bw() +
  ylab("Variance of\nLife Expectancy") +
  xlab("Continent") +
  coord_flip() +
  ggtitle("Figure 5")
```
  - Figure 1, 3 and Figure 5 are the same, supporting the conclusion that arrange does not relevel the factors.

Dataframe Appearance: `fct_reorder()`
continent	var_lifeExp
Africa	83.72635
Americas	87.33067
Asia	140.76711
Europe	29.51942
Oceania	14.40666

Dataframe Appearance: Using `arrange()`
continent	var_lifeExp
Oceania	14.40666
Europe	29.51942
Africa	83.72635
Americas	87.33067
Asia	140.76711

Conclusion: fct_reorder relevels the factors, but does not change the appearance of its dataframe, while arrange() changes the appearance of its dataframe, but does not relevel the factors.

Here is a summary table:

Figure	Function	Description	Outcome	Equal to
1	NA	This is the original plot with no attempt at releveling.	Factors are ordered alphabetically by default	Figure 3, 5
2	`fct_reorder()`	This is the plot where `fct_reorder()` was used as an `aes` in `ggplot()`	Factors are ordered by `var_lifeExp`	Figure 4
3	`arrange()`	This is the plot where `arrange()` was used on the dataframe before piping into `ggplot()`	Factors are still ordered alphabetically by default	Figure 1, 5
4	`fct_reorder()`	This is the plot where `fct_reorder()` was used on the factor in the dataframe, and then that dataframe was plotted	Factors are ordered by `var_lifeExp`	Figure 2
5	`arrange()`	This is the plot where `arrange()` was used on the dataframe and then that dataframe was plotted	Factors are still ordered alphabetically by default	Figure 1, 3

Here is a summary figure:
Looks like Oceania has the lowest variance in its life expectancy, while Asia has the highest variance.

Exercise 3: File input/output (I/O)

You are expected to create something new, probably by filtering or grouped-summarization of your dataset (for e.g., Singer, Gapminder, or another dataset), export it to disk and then reload it back in using one of the packages above. You should use here::here() for reading in and writing out.

With the imported data, play around with factor levels and use factors to order your data with one of your factors (i.e. non-alphabetically). For the I/O method(s) you chose, comment on whether or not your newly created file survived the round trip of writing to file then reading back in.

Write the data out to avg_gapminder.csv:

library(here)
csv_out <- gapminder %>%
  group_by(continent) %>%
  summarize(avglifeExp = mean(lifeExp), avgGDP = mean(gdpPercap), avgPop = mean(pop))

write_csv(csv_out, here::here("hw05", "avg_gapminder.csv"), col_names = TRUE)

What csv_out looks like:

`csv_out`: Dataframe written out to `avg_gapminder.csv`
continent	avglifeExp	avgGDP	avgPop
Africa	48.86533	2193.755	9916003
Americas	64.65874	7136.110	24504795
Asia	60.06490	7902.150	77038722
Europe	71.90369	14469.476	17169765
Oceania	74.32621	18621.609	8874672

Read the data back into csv_in:

csv_in <- read_csv(here::here("hw05", "avg_gapminder.csv"), col_names = TRUE)

## Parsed with column specification:
## cols(
##   continent = col_character(),
##   avglifeExp = col_double(),
##   avgGDP = col_double(),
##   avgPop = col_double()
## )

What csv_in looks like:

`csv_in`: Dataframe read in from `avg_gapminder.csv`
continent	avglifeExp	avgGDP	avgPop
Africa	48.86533	2193.755	9916003
Americas	64.65874	7136.110	24504795
Asia	60.06490	7902.150	77038722
Europe	71.90369	14469.476	17169765
Oceania	74.32621	18621.609	8874672

In order to play with factors, continent must be a factor first. Using the code above to read_csv(), the resulting dataframe does not actually contain factors! The continent column has been imported as a character class instead of factor:

str(csv_in)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 5 obs. of  4 variables:
##  $ continent : chr  "Africa" "Americas" "Asia" "Europe" ...
##  $ avglifeExp: num  48.9 64.7 60.1 71.9 74.3
##  $ avgGDP    : num  2194 7136 7902 14469 18622
##  $ avgPop    : num  9916003 24504795 77038722 17169765 8874672
##  - attr(*, "spec")=
##   .. cols(
##   ..   continent = col_character(),
##   ..   avglifeExp = col_double(),
##   ..   avgGDP = col_double(),
##   ..   avgPop = col_double()
##   .. )

To relevel factors, we must make sure continent is a factor. There are two ways to do this:

Adding additional parameters to read_csv():

csv_in <- read_csv(
  here::here("hw05", "avg_gapminder.csv"), 
  col_names = TRUE, 
  cols(
    continent = col_factor(),
    avglifeExp = col_double(),
    avgGDP = col_double(),
    avgPop = col_double()
  )
)
levels(csv_in$continent)

## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Using factor():

csv_in$continent <- factor(csv_in$continent)
levels(csv_in$continent)

## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"

Releveling (by avgPop):

csv_in$continent <- fct_reorder(csv_in$continent, csv_in$avgPop)
levels(csv_in$continent)

## [1] "Oceania"  "Africa"   "Europe"   "Americas" "Asia"

Check that releveling actually reordered the factors in order of increasing avgPop:

releveled <- csv_in %>%
  arrange(avgPop)

Using `arrange()` to check if factors releveled correctly
continent	avglifeExp	avgGDP	avgPop
Oceania	74.32621	18621.609	8874672
Africa	48.86533	2193.755	9916003
Europe	71.90369	14469.476	17169765
Americas	64.65874	7136.110	24504795
Asia	60.06490	7902.150	77038722

Since the order of continent in the table above matches the order of levels(csv_in$continent), releveling was successful!

Since csv_in and csv_out look identical to one another (with or without factor releveling), the newly created file avg_gapminder.csv DID survive the round trip of writing to file then reading back in.

Exercise 4: Visualization design

Go back through your previous assignments and class participation activities and find figures you created prior to the last week of the course. Recreate at least one figure in light of something you learned in the recent class meetings about visualization design and color.

Task: Create a side-by-side plot and juxtapose your first attempt (show the original figure as-is) with a revised attempt after some time spent working on it and implementing principles of effective plotting principles. Comment and reflect on the differences.

Before plot (I was given feedback from one of the TAs that it was too busy/crowded):

before_plot <- gapminder %>%
  mutate(million=pop/(10**6)) %>%
  ggplot(aes(x=million,y=lifeExp)) +
  geom_point(aes(shape = continent, colour = year)) +
  ggtitle("Life Expectancy vs Population") +
  xlab("Population (millions)") +
  ylab("Life Expectancy (years)")

After plot:

data <- gapminder %>%
  mutate(pop=pop/(10**6)) %>%
  filter(country == "Canada" | country == "United States" | country == "Mexico") %>%
  select(c(-gdpPercap)) %>% 
  pivot_longer(cols = c(lifeExp,pop),
               names_to = "key", 
               values_to = "value")

data$country <- fct_relevel(data$country,"Canada", "United States", "Mexico") %>%
  droplevels()

after_plot <- data %>%
  ggplot(aes(year,value)) +
  geom_line(aes(colour = country)) + 
  geom_point(aes(colour = country)) +
  facet_wrap(~ key, 
             ncol = 1, 
             scale = "free_y",
             labeller = labeller(key = c(lifeExp = "Life Expectancy (years)", 
                                         pop = "Population (millions)"))) +
  ggtitle("Trends in North America") +
  xlab("year") +
  ylab("") +
  theme_bw() +
  theme(text = element_text(size = 14), 
        panel.grid.major.x = element_blank(), 
        panel.grid.minor.x = element_blank(),
        legend.position="bottom"
  )

Combine the two side by side:

combined <- grid.arrange(before_plot,after_plot, ncol=2)

## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Commenting on the differences:
- Differences in the data:
  - pivoted the data longer to make use of faceting
  - subsetted the dataset to North American countries to resolve overcrowding
  - factors releveled to reflect North-South geographic location order
- Differences in the plot:
  - changed title to a less redundant/repetitive title
  - increased overall font size
  - added lines to the points
  - changed facet labels to replace y-axis labels
    - removed now redundant y-axis labels
  - removed vertical grid lines
  - removed grey background colour
  - instead of lifeExp vs pop, it is now lifeExp vs year and pop vs year
Reflecting on the differences:
- The trends are actually visible in the new plot
- In the old plot, with the colour scheme and the shapes, it was impossible to actually understand what the plot was trying to convey
- In the new plot, the subsampling has allowed for a clear message to be conveyed
- The faceting has also allowed for both trends to be seen simultaneously more easily
- The lines in the plot also allude to the trends to be observed
- The removed background and grid lines makes the plot less distracting

Exercise 5: Writing figures to file

Task: Use `ggsave()` to explicitly save a plot to file. Include the exported plot as part of your repository and assignment.

Then, use ![Alt text](/path/to/img.png) to load and embed that file into your report. You can play around with various options, such as:

Arguments of ggsave(), such as width, height, resolution or text scaling.
Various graphics devices, e.g. a vector vs. raster format.
Explicit provision of the plot object p via ggsave(..., plot = p). Show a situation in which this actually matters.

By default, ggsave() saves the most recently plot made using ggplot(). With the code below, ggsave() will save after_plot (the most recent plot) to combined.png. In this case, since combined (the plot we want to save) was not plotted directly using ggplot(), it is not the plot ggsave() chooses to save. Instead ggsave() chooses to save after_plot. Specifying plot = matters in this case.

ggsave(here::here("hw05","combined.png"), width = 16, height = 9, dpi = 300, device = "png")

To save the combined plot, plot = combined must be specified explicitly:

ggsave(here::here("hw05","combined.png"), plot = combined, width = 16, height = 9, dpi = 300, device = "png")

Using ![Alt text](/path/to/img.png) to load and embed that file into your report: Before and After Effective Visualization