suppressMessages(library(gapminder))
suppressMessages(library(tidyverse))
suppressMessages(library(DT))
suppressMessages(library(knitr))
suppressMessages(library(gridExtra))
suppressMessages(library(here))
here::here
packageRead through the blog post by Malcom Barrett where he outlines why one should use the here::here
package in RStudio projects.
here::here
package in 250 words or fewer.The here::here
package is very valuable for many reasons. One of the biggest reasons is that it maintains compatibility in directory paths across operating systems, where Unix systems use a /
but Windows uses \
for the directory hierarchy. Another reason that the here::here
package is valuable is its ability to select a directory that acts as a root directory. By using here()
you can bypass the relative path system and always provide an absolute path from the root. Furthermore, the root itself will be automatically set to the directory containing an .Rproj
file by default, but can be manually set using set_here()
, which creates a .here
file in the directory being set as the root. The use of the here::here
package resolves many conflicts/problems that could arise from hardcoding absolute or relative directory paths when sharing R scripts. This way, collaborators will never have to change hardcoded aspects of the code and can just run the R scripts just as they were received, removing a lot of unnecessary hassle that comes with project collaboration with multiple collaborators. At the same time, using the here::here
package increases reproducibility in data analysis, and increases accessibility for script end-users who may not be programmers. To summarize, the here::here
package not only facilitates many aspects of project collaboration, it also eases the dilemma of hardcoding absolute or relative paths in any R script, regardless of collaboration.
We’ve elaborated on these steps for the gapminder and singer data sets below.
Explore the effects of re-leveling a factor in a tibble by:
arrange
on the original and re-leveled factor.These explorations should involve the data, the factor levels, and at least two figures (before and after.
Drop Oceania. Filter the Gapminder data to remove observations associated with the continent
of Oceania. Additionally, remove unused factor levels. Provide concrete information on the data before and after removing these rows and Oceania; address the number of rows and the levels of the affected factors.
gapminder
data to remove observations associated with the continent
of Oceania.After dropping Oceania, the dataframe has 1680 rows, and the continent
column has 5 levels, with the 5 levels being the 5 continents, including Oceania.
no_oceania <- gapminder %>%
filter(continent != "Oceania")
nrow(no_oceania)
## [1] 1680
nlevels(no_oceania$continent)
## [1] 5
levels(no_oceania$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
After removing unused factors, the number of rows remains the same as after dropping Oceania, but the levels have decreased to 4, now excluding Oceania from the levels.
no_oceania_dropped <- no_oceania %>%
droplevels()
nrow(no_oceania_dropped)
## [1] 1680
nlevels(no_oceania_dropped$continent)
## [1] 4
levels(no_oceania_dropped$continent)
## [1] "Africa" "Americas" "Asia" "Europe"
Gapminder initially had 1704 rows and the continent
column initially had 5 levels, where the levels were the 5 continents.
nrow(gapminder)
## [1] 1704
nlevels(gapminder$continent)
## [1] 5
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
Here is a summary table to address the number of rows and the levels of the affected factors:
Data | Gapminder | Oceania Dropped | Unused Factors Dropped |
---|---|---|---|
number of rows | 1704 | 1680 | 1680 |
number of levels | 5 | 4 | 4 |
levels | Africa, Americas, Asia, Europe, Oceania | Africa, Americas, Asia, Europe, Oceania | Africa, Americas, Asia, Europe |
The number of levels (and the levels themselves) only decreased when droplevels()
was used, but the number of rows decreased when filter()
was used.
Reorder the levels of country
or continent
. Use the forcats
package to change the order of the factor levels, based on summarized information of one of the quantitative variables. Consider experimenting with a summary statistic beyond the most basic choice of the mean/median. Use the forcats
package in the tidyverse for this, rather than the baseR function as.factor
.
Before releveling:
plot <- gapminder %>%
group_by(continent) %>%
summarise(var_lifeExp = var(lifeExp)) %>%
ggplot() +
geom_col(aes(continent,var_lifeExp)) +
theme_bw() +
ylab("Variance of\nLife Expectancy") +
xlab("Continent") +
coord_flip() +
ggtitle("Figure 1")
Using fct_reorder()
:
fct_plot <- gapminder %>%
group_by(continent) %>%
summarise(var_lifeExp = var(lifeExp)) %>%
ggplot() +
geom_col(aes(fct_reorder(continent, var_lifeExp),var_lifeExp)) +
theme_bw() +
ylab("Variance of\nLife Expectancy") +
xlab("Continent") +
coord_flip() +
ggtitle("Figure 2")
Using arrange()
:
arrange_plot <- gapminder %>%
group_by(continent) %>%
summarise(var_lifeExp = var(lifeExp)) %>%
arrange(desc(var_lifeExp), .by_group = TRUE) %>%
ggplot() +
geom_col(aes(continent,var_lifeExp)) +
theme_bw() +
ylab("Variance of\nLife Expectancy") +
xlab("Continent") +
coord_flip() +
ggtitle("Figure 3")
arrange()
does not relevel the factors like fct_reorder()
does, evidenced by no change in the “before releveling” plot and the “using arrange()
” plot (i.e. Figure 1 and Figure 3 are identical). However, arrange()
does reorder the dataframe itself, just not releveling the factors.Using fct_reorder()
:
fct <- gapminder %>%
group_by(continent) %>%
summarise(var_lifeExp = var(lifeExp))
fct$continent <- fct_reorder(fct$continent, fct$var_lifeExp)
fct_reorder()
does not change the actual ordering of the var_lifeExp
column when shown in its dataframe format:continent | var_lifeExp |
---|---|
Africa | 83.72635 |
Americas | 87.33067 |
Asia | 140.76711 |
Europe | 29.51942 |
Oceania | 14.40666 |
var_lifeExp
:levels(fct$continent)
## [1] "Oceania" "Europe" "Africa" "Americas" "Asia"
fct
dataframe:fct_df_plot <- ggplot(fct) +
geom_col(aes(continent,var_lifeExp)) +
theme_bw() +
ylab("Variance of\nLife Expectancy") +
xlab("Continent") +
coord_flip() +
ggtitle("Figure 4")
fct_reorder()
relevels the factors.Using arrange()
:
arrange <- gapminder %>%
group_by(continent) %>%
summarise(var_lifeExp = var(lifeExp)) %>%
arrange(var_lifeExp, .by_group = TRUE)
arrange()
does change the actual ordering of the var_lifeExp
column:continent | var_lifeExp |
---|---|
Oceania | 14.40666 |
Europe | 29.51942 |
Africa | 83.72635 |
Americas | 87.33067 |
Asia | 140.76711 |
var_lifeExp
:levels(arrange$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
arrange
dataframe:arrange_df_plot <- ggplot(arrange) +
geom_col(aes(continent,var_lifeExp)) +
theme_bw() +
ylab("Variance of\nLife Expectancy") +
xlab("Continent") +
coord_flip() +
ggtitle("Figure 5")
arrange
does not relevel the factors.fct_reorder
relevels the factors, but does not change the appearance of its dataframe, while arrange()
changes the appearance of its dataframe, but does not relevel the factors.
Figure | Function | Description | Outcome | Equal to |
---|---|---|---|---|
1 | NA | This is the original plot with no attempt at releveling. | Factors are ordered alphabetically by default | Figure 3, 5 |
2 | fct_reorder() |
This is the plot where fct_reorder() was used as an aes in ggplot() |
Factors are ordered by var_lifeExp |
Figure 4 |
3 | arrange() |
This is the plot where arrange() was used on the dataframe before piping into ggplot() |
Factors are still ordered alphabetically by default | Figure 1, 5 |
4 | fct_reorder() |
This is the plot where fct_reorder() was used on the factor in the dataframe, and then that dataframe was plotted |
Factors are ordered by var_lifeExp |
Figure 2 |
5 | arrange() |
This is the plot where arrange() was used on the dataframe and then that dataframe was plotted |
Factors are still ordered alphabetically by default | Figure 1, 3 |
Here is a summary figure:
Looks like Oceania has the lowest variance in its life expectancy, while Asia has the highest variance.
You are expected to create something new, probably by filtering or grouped-summarization of your dataset (for e.g., Singer, Gapminder, or another dataset), export it to disk and then reload it back in using one of the packages above. You should use here::here()
for reading in and writing out.
With the imported data, play around with factor levels and use factors to order your data with one of your factors (i.e. non-alphabetically). For the I/O method(s) you chose, comment on whether or not your newly created file survived the round trip of writing to file then reading back in.
Write the data out to avg_gapminder.csv
:
library(here)
csv_out <- gapminder %>%
group_by(continent) %>%
summarize(avglifeExp = mean(lifeExp), avgGDP = mean(gdpPercap), avgPop = mean(pop))
write_csv(csv_out, here::here("hw05", "avg_gapminder.csv"), col_names = TRUE)
What csv_out
looks like:
continent | avglifeExp | avgGDP | avgPop |
---|---|---|---|
Africa | 48.86533 | 2193.755 | 9916003 |
Americas | 64.65874 | 7136.110 | 24504795 |
Asia | 60.06490 | 7902.150 | 77038722 |
Europe | 71.90369 | 14469.476 | 17169765 |
Oceania | 74.32621 | 18621.609 | 8874672 |
Read the data back into csv_in
:
csv_in <- read_csv(here::here("hw05", "avg_gapminder.csv"), col_names = TRUE)
## Parsed with column specification:
## cols(
## continent = col_character(),
## avglifeExp = col_double(),
## avgGDP = col_double(),
## avgPop = col_double()
## )
What csv_in
looks like:
continent | avglifeExp | avgGDP | avgPop |
---|---|---|---|
Africa | 48.86533 | 2193.755 | 9916003 |
Americas | 64.65874 | 7136.110 | 24504795 |
Asia | 60.06490 | 7902.150 | 77038722 |
Europe | 71.90369 | 14469.476 | 17169765 |
Oceania | 74.32621 | 18621.609 | 8874672 |
In order to play with factors, continent
must be a factor first. Using the code above to read_csv()
, the resulting dataframe does not actually contain factors! The continent
column has been imported as a character
class instead of factor
:
str(csv_in)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 4 variables:
## $ continent : chr "Africa" "Americas" "Asia" "Europe" ...
## $ avglifeExp: num 48.9 64.7 60.1 71.9 74.3
## $ avgGDP : num 2194 7136 7902 14469 18622
## $ avgPop : num 9916003 24504795 77038722 17169765 8874672
## - attr(*, "spec")=
## .. cols(
## .. continent = col_character(),
## .. avglifeExp = col_double(),
## .. avgGDP = col_double(),
## .. avgPop = col_double()
## .. )
continent
is a factor. There are two ways to do this:
Adding additional parameters to read_csv()
:
csv_in <- read_csv(
here::here("hw05", "avg_gapminder.csv"),
col_names = TRUE,
cols(
continent = col_factor(),
avglifeExp = col_double(),
avgGDP = col_double(),
avgPop = col_double()
)
)
levels(csv_in$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
Using factor()
:
csv_in$continent <- factor(csv_in$continent)
levels(csv_in$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
Releveling (by avgPop
):
csv_in$continent <- fct_reorder(csv_in$continent, csv_in$avgPop)
levels(csv_in$continent)
## [1] "Oceania" "Africa" "Europe" "Americas" "Asia"
Check that releveling actually reordered the factors in order of increasing avgPop
:
releveled <- csv_in %>%
arrange(avgPop)
continent | avglifeExp | avgGDP | avgPop |
---|---|---|---|
Oceania | 74.32621 | 18621.609 | 8874672 |
Africa | 48.86533 | 2193.755 | 9916003 |
Europe | 71.90369 | 14469.476 | 17169765 |
Americas | 64.65874 | 7136.110 | 24504795 |
Asia | 60.06490 | 7902.150 | 77038722 |
Since the order of continent
in the table above matches the order of levels(csv_in$continent)
, releveling was successful!
Since csv_in
and csv_out
look identical to one another (with or without factor releveling), the newly created file avg_gapminder.csv
DID survive the round trip of writing to file then reading back in.
Go back through your previous assignments and class participation activities and find figures you created prior to the last week of the course. Recreate at least one figure in light of something you learned in the recent class meetings about visualization design and color.
Before plot (I was given feedback from one of the TAs that it was too busy/crowded):
before_plot <- gapminder %>%
mutate(million=pop/(10**6)) %>%
ggplot(aes(x=million,y=lifeExp)) +
geom_point(aes(shape = continent, colour = year)) +
ggtitle("Life Expectancy vs Population") +
xlab("Population (millions)") +
ylab("Life Expectancy (years)")
After plot:
data <- gapminder %>%
mutate(pop=pop/(10**6)) %>%
filter(country == "Canada" | country == "United States" | country == "Mexico") %>%
select(c(-gdpPercap)) %>%
pivot_longer(cols = c(lifeExp,pop),
names_to = "key",
values_to = "value")
data$country <- fct_relevel(data$country,"Canada", "United States", "Mexico") %>%
droplevels()
after_plot <- data %>%
ggplot(aes(year,value)) +
geom_line(aes(colour = country)) +
geom_point(aes(colour = country)) +
facet_wrap(~ key,
ncol = 1,
scale = "free_y",
labeller = labeller(key = c(lifeExp = "Life Expectancy (years)",
pop = "Population (millions)"))) +
ggtitle("Trends in North America") +
xlab("year") +
ylab("") +
theme_bw() +
theme(text = element_text(size = 14),
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
legend.position="bottom"
)
Combine the two side by side:
combined <- grid.arrange(before_plot,after_plot, ncol=2)
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
lifeExp
vs pop
, it is now lifeExp
vs year
and pop
vs year
ggsave()
to explicitly save a plot to file. Include the exported plot as part of your repository and assignment.Then, use 
to load and embed that file into your report. You can play around with various options, such as:
ggsave()
, such as width, height, resolution or text scaling.p
via ggsave(..., plot = p)
. Show a situation in which this actually matters.By default, ggsave()
saves the most recently plot made using ggplot()
. With the code below, ggsave()
will save after_plot
(the most recent plot) to combined.png
. In this case, since combined
(the plot we want to save) was not plotted directly using ggplot()
, it is not the plot ggsave()
chooses to save. Instead ggsave()
chooses to save after_plot
. Specifying plot =
matters in this case.
ggsave(here::here("hw05","combined.png"), width = 16, height = 9, dpi = 300, device = "png")
To save the combined
plot, plot = combined
must be specified explicitly:
ggsave(here::here("hw05","combined.png"), plot = combined, width = 16, height = 9, dpi = 300, device = "png")
Using 
to load and embed that file into your report: