The main value of the here::here package is based on the principle of reproducibility. It also allows the user to run their scripts in a platform agnostic manner and share scripts with others. The advantages of this package can be seen by comparing it to the commonly discouraged alternative setwd().Setwd relies on paths that are user specified but also very user , computer and time specific. Here::here on the other hand uses a specified order (or heuristics) to find the root directory, and thus for any computer it is run on, the correct path will be found.
In Summary here::here:
Allows better collaboration and allows others to view and run your scripts on different platforms and not rely on user specific paths
if you use setwd with an absolute paths, you run the risk of not being able to run the script on another computer, at a different time, OS, or outside of R studio
We will be using the gapminder data set to explore factor management:
Lets look at the factors in gapminder again before dropping Oceania
gapminder$continent %>%
levels()
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
gapminder$continent %>%
nlevels
## [1] 5
gapminder %>%
str()
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
We see there are 5 levels in continent and 142 levels in country. There are 1704 rows
gap_no_ocean <- gapminder %>%
filter(!(continent=="Oceania"))
str(gap_no_ocean)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## Before dropping the unusued levels we see 142 levels in country and 5 levels in continent, that's not right! We have a lost continent, so we need to drop some levels.
gap_no_ocean <- gap_no_ocean %>%
droplevels()
str(gap_no_ocean)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## Second time after dropping
See comment within code regarding before dropping levels.
As you can see the number of levels in continent dropped to 4, and number of countries dropped to 140 after removing the levels. Number of rows in both datasets was 1680, which is less than the original 1704.
Lets reorder our factor of country from Gapminder based on maximum gdp Per Capita.
gapminder %>%
ggplot() +
geom_col(aes(continent,max(gdpPercap)))+
coord_flip()+
theme_bw() +
scale_y_continuous(labels=scales::comma)+
ylab("GDP Per Capita") + xlab("Continent") +
ggtitle("Before Reordering")
## Use a boxplot to visualized the spread and max gdp's we expect to see
gapminder %>%
ggplot(aes(continent,gdpPercap))+
geom_boxplot()+
theme_bw() +
scale_y_log10(labels=scales::comma)+
ylab("log GDP Per Capita") + xlab("Continent") +
ggtitle("Before Reordering")
Based on the boxplot we predict that we will see Asia as the highest max gdp (upper most whisker) and Africa as the lowest max gdp (upper most whisker is smaller here)
Now let’s create plots based on rearranging the data for each continent based on max gdp and let’s also directly plot the rearranged max gdp table and see if our plot is ordered in the same order as the table.
## Let's find the maximum gdp per continent and arranged by those
arranged_gdp <- gapminder %>%
group_by(continent) %>%
summarize(maxgdp=max(gdpPercap)) %>%
arrange(desc(maxgdp))
arranged_gdp %>% knitr::kable()
continent | maxgdp |
---|---|
Asia | 113523.13 |
Europe | 49357.19 |
Americas | 42951.65 |
Oceania | 34435.37 |
Africa | 21951.21 |
arranged_gdp %>%
ggplot() +
geom_col(aes(continent,maxgdp))+
coord_flip()+
theme_bw() +
ylab( "GDP per Capita") + xlab("Continent") +
ggtitle("Using Arrange")
The continents are arranged correctly by max gdp in the table but not on the plot. They are instead in alphabetical order on the plot. Arrange therefore does not affect the plotting or reorder the levels.
What about if we use fct_reorder?
## plot_oder_max_gdp is in the same order as the grouped and arranged table aka arranged_gap
plot_order_max_gdp <- arranged_gdp %>%
ggplot() +
geom_col(aes(y=maxgdp, x=fct_reorder(continent,maxgdp)))+
coord_flip()+
theme_bw() +
ylab("GDP per Capita") + xlab("Continent") +
ggtitle("Ordered based on Maximum GDP per Continent")
plot_order_max_gdp
## https://ggplot2.tidyverse.org/reference/geom_bar.html Ref for geom_col
## https://stat545.com/factors-boss.html --> learned how to plot bar graphs
## https://cmdlinetips.com/2019/02/how-to-reorder-a-boxplot-in-r/ how to reorder a boxplot
The plot is ordered correctly based on maximum gdp when factor reorder was used. Interesting notes about the plot: max gdp is highest in Asia, lowest in Africa
## Before Reordering
levels(arranged_gdp$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
##Using Factor Reorder
arranged_gdp$continent <- fct_reorder(arranged_gdp$continent,arranged_gdp$maxgdp)
levels(arranged_gdp$continent)
## [1] "Africa" "Oceania" "Americas" "Europe" "Asia"
As you can see arrange only changes how they are viewed in a table, while fct_reorder actually changes the order of the levels in the factor, as seen in both the plot and the levels() function ,in this case in ascending order from left to right from smallest to greatest maximal gdp(ascending order). On the other hand, the original arranged_gdp levels are in alphabetical order.
Let’s create a data set that stores the continent and the weighted mean of gdp per Capita with population, weighted standard deviation, and mean and sd of life Expectancy.
gap_wtgdp_meanlife <- gapminder %>%
group_by(continent) %>%
summarize(wt.mean(gdpPercap,pop),wt.sd(gdpPercap,pop),mean(lifeExp),sd(lifeExp))
gap_wtgdp_meanlife
## # A tibble: 5 x 5
## continent `wt.mean(gdpPer… `wt.sd(gdpPerca… `mean(lifeExp)` `sd(lifeExp)`
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 2108. 2189. 48.9 9.15
## 2 Americas 15477. 12110. 64.7 9.35
## 3 Asia 2950. 5240. 60.1 11.9
## 4 Europe 15693. 8903. 71.9 5.43
## 5 Oceania 21205. 7420. 74.3 3.80
## Used SDMTools Package
write_csv(gap_wtgdp_meanlife,here("Hw05","Gapminder_WT_GDP_Mean_Life.csv"))
Gapminder_read_in <- read_csv(here("HW05","Gapminder_WT_GDP_Mean_Life.csv"))
## Parsed with column specification:
## cols(
## continent = col_character(),
## `wt.mean(gdpPercap, pop)` = col_double(),
## `wt.sd(gdpPercap, pop)` = col_double(),
## `mean(lifeExp)` = col_double(),
## `sd(lifeExp)` = col_double()
## )
After reading in using the default setting read_csv setting, the factor continent has changed to a class of type character,
## Let's convert continent back into a factor first
Gapminder_read_in$continent <- as_factor(Gapminder_read_in$continent)
class(Gapminder_read_in$continent)
## [1] "factor"
## Let's reorder our factors, first by the weighted gdp Per Capita
## First let's try if just using fct_reorder is enough for the table:
Gapminder_read_in$continent <- fct_reorder(Gapminder_read_in$continent,Gapminder_read_in$`wt.mean(gdpPercap, pop)`)
Gapminder_read_in
## # A tibble: 5 x 5
## continent `wt.mean(gdpPer… `wt.sd(gdpPerca… `mean(lifeExp)` `sd(lifeExp)`
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 2108. 2189. 48.9 9.15
## 2 Americas 15477. 12110. 64.7 9.35
## 3 Asia 2950. 5240. 60.1 11.9
## 4 Europe 15693. 8903. 71.9 5.43
## 5 Oceania 21205. 7420. 74.3 3.80
## No
## Have to use arrange
gapreadin_wt<- Gapminder_read_in %>%
arrange(desc(`wt.mean(gdpPercap, pop)`))
gapreadin_wt
## # A tibble: 5 x 5
## continent `wt.mean(gdpPer… `wt.sd(gdpPerca… `mean(lifeExp)` `sd(lifeExp)`
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Oceania 21205. 7420. 74.3 3.80
## 2 Europe 15693. 8903. 71.9 5.43
## 3 Americas 15477. 12110. 64.7 9.35
## 4 Asia 2950. 5240. 60.1 11.9
## 5 Africa 2108. 2189. 48.9 9.15
gapreadin_wt %>%
ggplot() +
geom_col(aes(y=`wt.mean(gdpPercap, pop)`, x=fct_reorder(continent,`wt.mean(gdpPercap, pop)`)))+
coord_flip()+
theme_bw() +
ylab("Weighted GDP per Capita") + xlab("Continent") +
ggtitle("Ordered based on GDP per Capita Weighted by Population")
Learned that using fct_reorder does not arrange the factors in the correct order in a table, arrange is what you have to use. Interestingly Oceania has highest weighted GDP per capita whereas Asia and Africa are lower.
## sd gdp per capita
gapreadin_sd_wtgdp<- Gapminder_read_in %>%
arrange(desc(`wt.sd(gdpPercap, pop)`))
gapreadin_sd_wtgdp %>%
ggplot() +
geom_col(aes(y=`wt.sd(gdpPercap, pop)`, x=fct_reorder(continent,`wt.sd(gdpPercap, pop)`)))+
coord_flip()+
theme_bw() +
ylab("Weighted Standard Deviation of GDP per Capita") + xlab("Continent") +
ggtitle("Ordered based on Standard Deviation of Weighted by Population- GDP")
## sd of mean Life Expectancy
gapreadin_sd_lifeExp<- Gapminder_read_in %>%
arrange(desc(`sd(lifeExp)`))
gapreadin_sd_lifeExp %>%
ggplot() +
geom_col(aes(y=`sd(lifeExp)`, x=fct_reorder(continent,`sd(lifeExp)`)))+
coord_flip()+
theme_bw() +
ylab("Standard Deviation of Life Expectancy") + xlab("Continent") +
ggtitle("Ordered based on Standard Deviation of Life Expectancy By Continent")
Interesting to note how order changes based on what standard deviation is used.
In assignment 3, I originally created a facet wrapped side by side histogram to compare spread of gdp Per capita. This made comparing the spreads of different continents with each other a bit more difficult. It also had no title. A peer reviewer recommended using geom_ridge to create a ridge plot. I have done this below.
Let’s view the original
## Original
density_gdpPercap <- gapminder%>%
ggplot(aes(x=log(gdpPercap))) +
geom_density() +
facet_wrap(. ~continent)
density_gdpPercap
Let’s view an updated ridge plot and compare the both types of plots.
ridge_gdpPercap <- gapminder%>%
ggplot(aes(y=continent,x=gdpPercap)) +
scale_x_log10(labels=scales::comma)+
ggridges::geom_density_ridges(aes(y=continent))+
xlab("log of GDP per Capita")+
theme_bw()
ridge_gdpPercap+
ggtitle("Spread of GDP per Capita of Different Continents")
## Picking joint bandwidth of 0.0962
## Plotting side by side
density_plots <- ggarrange(density_gdpPercap,ridge_gdpPercap,labels="AUTO") %>%
annotate_figure(top="Spread of GDP per Capita of Different Continents")
## Picking joint bandwidth of 0.0962
density_plots
## Learned about ggarrange from :http://www.sthda.com/english/wiki/print.php?id=177
## Common title learned fr. : https://stackoverflow.com/questions/49825971/add-a-common-legend
The ridge plot(B) makes it much easier to compare the spreads of different continents on one plot and makes it easier to see trends compared to the original(A) .
The difference:
I didn’t add a title to the original plot objects before plotting side by side because the objects kept getting cut off.
Also in assignment 3, I created group bar charts to look at the number of countries that are below the global median.Jenny Bryan mentions that grouped bar charts make it harder to see trends between items that are not close to each other, she does not recommend grouped bar charts on her stat 545 page.
Source: https://stat545.com/effective-graphs.html
This was used to generate the initial graph, it will save typing and make code cleaner to create an object from this code:
gap_median <- gapminder %>%
group_by(year) %>%
mutate(median_lifeExp=median(lifeExp)) %>%
ungroup(year) %>%
mutate(less_than_median= if_else(lifeExp<median_lifeExp,TRUE,FALSE)) %>%
filter (less_than_median==TRUE) %>%
group_by(continent,year,less_than_median) %>%
summarize(n_less_than_median=sum(less_than_median))
grouped_bar <- gap_median %>%
ggplot(aes(x=year, y=n_less_than_median,group=continent))+
geom_bar(aes(fill=continent),position="dodge",stat="identity")+
ylab("Number less than Median")
grouped_bar
In the case above, yes it is difficult to compare the continents with each other over time.
Claus O. Wilke in his book Fundamentals of Data Visualization also mentions grouped bar charts are confusing and cautions against them. He recommends faceted wrapped bar charts. Source(https://serialmentor.com/dataviz/visualizing-amounts.html#visualizing-amounts)
I followed his recommendation and created a new facet wrap bar chart and then put both bar charts side by side.
facet_wrap_bar <- gap_median %>%
ggplot(aes(x=year, y=n_less_than_median,group=continent))+
geom_col(fill="light blue")+
ylab("Number of Countries")+
facet_wrap(.~continent)+
theme_bw()
facet_wrap_bar +
ggtitle("Number of Countries with Less than Median Life Expectancy vs Time")
bar_less_med <- ggarrange(grouped_bar,facet_wrap_bar,labels = c("C","D")) %>%
annotate_figure(top="Number of Countries with Less than Median Life Expectancy vs Time")
bar_less_med
Observations: The facet wrapped bar chart(D) makes it much easier to compare the number of countries in each continent over time that was less than the median.It also makes it easier to see trends compared to the original(C),like for example, Africa has the greatest number of countries with less than median over time.Also Oceania is not present in both graphs, but this is more obvious in D, meaning it has never had a country with less than the median life Expectancy, and other trends are also present.
The difference:
Let’s save all 4 plots from 4.0 above as one plot.
four_plots <- ggarrange(density_plots,bar_less_med,nrow=2,ncol=1)
ggsave(here("HW05","Combined_Gapminder_Plots.png"), four_plots,width=10,height=10)
Here are the 4 plots I created in part 4.