(Refer back to the Advanced Data Manipulation lesson).

Key Concepts

  • dplyr verbs
  • the pipe %>%
  • the tbl_df
  • variable creation
  • multiple conditions
  • properties of grouped data
  • aggregation
  • summary functions
  • window functions

Getting Started

We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/ subfolder of the project directory where you can access it easily from R.

Load the dplyr and readr packages, and read the gapminder data into R using the read_csv() function (n.b. read_csv() is not the same as read.csv()). Assign the data to an object called gm.

In your submitted homework assignment, I would prefer you use the read_csv() function to read the data directly from the web (see below). This way I can run your R code without worrying about whether I have the data/ directory or not.

library(dplyr)
library(readr)

# Preferably: read data from web
gm <- read_csv("http://bioconnector.org/workshops/data/gapminder.csv")

# Alternatively read from file:
# gm <- read_csv("data/gapminder.csv")

# Display the data
gm

Problem set

Use dplyr functions to address the following questions:

  1. How many unique countries are represented per continent?
## # A tibble: 5 x 2
##   continent     n
##       <chr> <int>
## 1    Africa    52
## 2  Americas    25
## 3      Asia    33
## 4    Europe    30
## 5   Oceania     2
  1. Which European nation had the lowest GDP per capita in 1997?
## # A tibble: 1 x 6
##   country continent  year lifeExp     pop gdpPercap
##     <chr>     <chr> <int>   <dbl>   <int>     <dbl>
## 1 Albania    Europe  1997      73 3428038      3193
  1. According to the data available, what was the average life expectancy across each continent in the 1980s?
## # A tibble: 5 x 2
##   continent mean.lifeExp
##       <chr>        <dbl>
## 1    Africa         52.5
## 2  Americas         67.2
## 3      Asia         63.7
## 4    Europe         73.2
## 5   Oceania         74.8
  1. What 5 countries have the highest total GDP over all years combined?
## # A tibble: 5 x 2
##          country Total.GDP
##            <chr>     <dbl>
## 1  United States  7.68e+13
## 2          Japan  2.54e+13
## 3          China  2.04e+13
## 4        Germany  1.95e+13
## 5 United Kingdom  1.33e+13
  1. What countries and years had life expectancies of at least 80 years? N.b. only output the columns of interest: country, life expectancy and year (in that order).
## # A tibble: 22 x 3
##             country lifeExp  year
##               <chr>   <dbl> <int>
##  1        Australia    80.4  2002
##  2        Australia    81.2  2007
##  3           Canada    80.7  2007
##  4           France    80.7  2007
##  5 Hong Kong, China    80.0  1997
##  6 Hong Kong, China    81.5  2002
##  7 Hong Kong, China    82.2  2007
##  8          Iceland    80.5  2002
##  9          Iceland    81.8  2007
## 10           Israel    80.7  2007
## # ... with 12 more rows
  1. What 10 countries have the strongest correlation (in either direction) between life expectancy and per capita GDP?
## # A tibble: 10 x 2
##           country     r
##             <chr> <dbl>
##  1         France 0.996
##  2        Austria 0.993
##  3        Belgium 0.993
##  4         Norway 0.992
##  5           Oman 0.991
##  6 United Kingdom 0.990
##  7          Italy 0.990
##  8         Israel 0.988
##  9        Denmark 0.987
## 10      Australia 0.986
  1. Which combinations of continent (besides Asia) and year have the highest average population across all countries? N.b. your output should include all results sorted by highest average population. With what you already know, this one may stump you. See this Q&A for how to ungroup before arrangeing. This also behaves differently in more recent versions of dplyr.
## # A tibble: 48 x 3
##    continent  year mean.pop
##        <chr> <int>    <dbl>
##  1  Americas  2007 35954847
##  2  Americas  2002 33990910
##  3  Americas  1997 31876016
##  4  Americas  1992 29570964
##  5  Americas  1987 27310159
##  6  Americas  1982 25211637
##  7  Americas  1977 23122708
##  8  Americas  1972 21175368
##  9    Europe  2007 19536618
## 10    Europe  2002 19274129
## # ... with 38 more rows
  1. Which three countries have had the most consistent population estimates (i.e. lowest standard deviation) across the years of available data?
## # A tibble: 3 x 2
##                 country sd.pop
##                   <chr>  <dbl>
## 1 Sao Tome and Principe  45906
## 2               Iceland  48542
## 3            Montenegro  99738
  1. Subset gm to only include observations from 1992 and store the results as gm1992. What kind of object is this?
## [1] "tbl_df"     "tbl"        "data.frame"
  1. Bonus! Which observations indicate that the population of a country has decreased from the previous year and the life expectancy has increased from the previous year? See the vignette on window functions.
## Source: local data frame [36 x 6]
## Groups: country [22]
## 
## # A tibble: 36 x 6
##                   country continent  year lifeExp      pop gdpPercap
##                     <chr>     <chr> <int>   <dbl>    <int>     <dbl>
##  1            Afghanistan      Asia  1982    39.9 12881816       978
##  2 Bosnia and Herzegovina    Europe  1992    72.2  4256013      2547
##  3 Bosnia and Herzegovina    Europe  1997    73.2  3607000      4766
##  4               Bulgaria    Europe  2002    72.1  7661799      7697
##  5               Bulgaria    Europe  2007    73.0  7322858     10681
##  6                Croatia    Europe  1997    73.7  4444595      9876
##  7         Czech Republic    Europe  1997    74.0 10300707     16049
##  8         Czech Republic    Europe  2002    75.5 10256295     17596
##  9         Czech Republic    Europe  2007    76.5 10228744     22833
## 10      Equatorial Guinea    Africa  1977    42.0   192675       959
## # ... with 26 more rows

Source: https://raw.githubusercontent.com/4va/biodatasci/master/r-dplyr-homework.Rmd