10. Manipulate data with dplyr

When working on a real project, data will seldom (if ever!) arrive in exactly the format you would like to have it in in order to analyse it. We need to manipulate and transform data and just as we have a grammar for generating graphics (the layered grammar of graphics in ggplot), we also have a syntax for data transformation.

dplyr is a package that contains useful functions for transforming and manipulating data frames. You can think of these functions as verbs, that do something to the data. All of the dplyr verbs (or functions), and in fact pretty much everything in the tidyverse, works in the following fashion:

The first argument is a data frame
Subsequent arguments describe what to do with the data frame
The result is a new data frame

10.1 The `pipe` operator, or `%>%`

The pipe operator, this strange %>% thing, takes the value to the left of it and passes it through to the thing to the right of it. Let us create a couple of lists and a simple function to see an example

# A list (or vector) of multiple values too:
my_first_list <- c(1, 2, 3, 5, 8, 13, 21, 34, 55, 89)
my_second_list <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55)

# Define a function that takes X and adds 100
my_function <- function(x) {
  new_x <- x + 100
  return(new_x)
}

Functions work on single values and on lists (or vectors):

# call my_function with x=14 as an argument
my_function(14)

## [1] 114

# call my_function with x=my_first_list as an argument; this is a 
# vectorised operation, as it will add 100 to each value in my_first_list
my_function(my_first_list)

##  [1] 101 102 103 105 108 113 121 134 155 189

# call my_function with my_first_list+my_second_list as argument; this is a 
# vectorised operation, as it will first add my_first_list+my_second_list 
# and then add 100 to each value 
my_function(my_first_list+my_second_list)

##  [1] 102 103 105 108 113 121 134 155 189 244

We can nest functions inside each other and use mean(my_function(my_first_list))

mean(my_function(my_first_list))

## [1] 123

But this can get really hard to read, since you have to read from the inside out. In English, this nested mess reads “Calculate the mean of the results of my_function applied to my_first_list.” We can simplify this by reversing the nested chain and using the pipe operator

my_first_list %>% 
  my_function() %>% 
  mean()

## [1] 123

Here we start with the data and then describe the actions/verbs to do something to the data. We can read this chain as "Take my_first_list, pass it through my_function, and calculate the mean of that.

The %>% is called a pipe and you can also read or think of the pipe operator as the words “and then.” There’s also a keyboard shortcut for this too, since typing %>% all the time can be tedious: In Windows you would use Ctrl + Shift + M and in Mac wou would use ⌘ /Cmd+ shift + M

Similarly, we frequently need to perform a series of intermediate steps to transform data for analysis. If we write each step as a discrete command and store their contents as new objects, our code becomes difficult to read and understand.

When speaking or writing, we never start with a sentence with a verb, but rather with a noun (subject). It is good practice to start with a dataframe/object and then use verbs (or functions) to describe what you want to do.

Suppose we wanted to look at the first few rows of life expectancy values, using the head() function, of the gapminder dataframe.

# Nested command, rather hard to read, since we read from the inside out
head(select(gapminder,lifeExp))

## # A tibble: 6 x 1
##   lifeExp
##     <dbl>
## 1    28.8
## 2    30.3
## 3    32.0
## 4    34.0
## 5    36.1
## 6    38.4

# using the pipe operator: Start with gapminder, and then
gapminder %>% 
  
  # select the column (or variable) lifeExp, and then 
  select(lifeExp) %>% 
  
  # use the head() function to return the first few rows of the dataset 
  head()

## # A tibble: 6 x 1
##   lifeExp
##     <dbl>
## 1    28.8
## 2    30.3
## 3    32.0
## 4    34.0
## 5    36.1
## 6    38.4

10.2 Key functions in `dplyr`

There are 6 important verbs that you’ll typically use when working with data:

Extract columns/variables with select()
Extract rows/cases with filter()
Arrange/sort rows with arrange()
Make new columns/variables with mutate()
Make group summaries with group_by %>% summarise()

`function()`	Action performed
`select()`	Selects a subset of columns (or variables) from the data frame
`filter()`	Subsets observations based on their values
`arrange()`	Changes the order of observations based on their values
`mutate()`	Creates new columns (or variables)
`group_by()`	Changes the unit of analysis from the complete dataset to individual groups of columns
`summarise()`	Collapses the data frame to a smaller number of rows which summarise the larger data

Every dplyr verb follows the same pattern. The first argument is always a data frame, and the function always returns a data frame:

VERB(DATA_TO_TRANSFORM, STUFF_IT_DOES)

10.2.1 Pick columns with `select()`

If we want to select ], or drop, specific columns from a tibble, we use the select() verb. For instance, if we wanted to keep only the lifeExp and year columns:

gapminder %>% select(lifeExp, year)


## # A tibble: 1,704 x 2
##    lifeExp  year
##      <dbl> <int>
##  1    28.8  1952
##  2    30.3  1957
##  3    32.0  1962
##  4    34.0  1967
##  5    36.1  1972
##  6    38.4  1977
##  7    39.9  1982
##  8    40.8  1987
##  9    41.7  1992
## 10    41.8  1997
## # ... with 1,694 more rows

You can remove specific columns by prefacing the column names with a minus sign -. SO to drop -lifeExp from our tibble, we would use:

gapminder %>% select(-lifeExp)


## # A tibble: 1,704 x 5
##    country     continent  year      pop gdpPercap
##    <fct>       <fct>     <int>    <int>     <dbl>
##  1 Afghanistan Asia       1952  8425333      779.
##  2 Afghanistan Asia       1957  9240934      821.
##  3 Afghanistan Asia       1962 10267083      853.
##  4 Afghanistan Asia       1967 11537966      836.
##  5 Afghanistan Asia       1972 13079460      740.
##  6 Afghanistan Asia       1977 14880372      786.
##  7 Afghanistan Asia       1982 12881816      978.
##  8 Afghanistan Asia       1987 13867957      852.
##  9 Afghanistan Asia       1992 16317921      649.
## 10 Afghanistan Asia       1997 22227415      635.
## # ... with 1,694 more rows

You can also rename columns using select(), using the syntax select(new_name = old_name).

gapminder %>% select(year, country, life_expectancy = lifeExp)


## # A tibble: 1,704 x 3
##     year country     life_expectancy
##    <int> <fct>                 <dbl>
##  1  1952 Afghanistan            28.8
##  2  1957 Afghanistan            30.3
##  3  1962 Afghanistan            32.0
##  4  1967 Afghanistan            34.0
##  5  1972 Afghanistan            36.1
##  6  1977 Afghanistan            38.4
##  7  1982 Afghanistan            39.9
##  8  1987 Afghanistan            40.8
##  9  1992 Afghanistan            41.7
## 10  1997 Afghanistan            41.8
## # ... with 1,694 more rows

Alternatively, there’s a special rename() verb with the same syntax, i.e., rename(new_name = old_name) that will rename a column to a new name, while keeping all the other columns:

gapminder %>% rename(life_expectancy = lifeExp)


## # A tibble: 1,704 x 6
##    country     continent  year life_expectancy      pop gdpPercap
##    <fct>       <fct>     <int>           <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952            28.8  8425333      779.
##  2 Afghanistan Asia       1957            30.3  9240934      821.
##  3 Afghanistan Asia       1962            32.0 10267083      853.
##  4 Afghanistan Asia       1967            34.0 11537966      836.
##  5 Afghanistan Asia       1972            36.1 13079460      740.
##  6 Afghanistan Asia       1977            38.4 14880372      786.
##  7 Afghanistan Asia       1982            39.9 12881816      978.
##  8 Afghanistan Asia       1987            40.8 13867957      852.
##  9 Afghanistan Asia       1992            41.7 16317921      649.
## 10 Afghanistan Asia       1997            41.8 22227415      635.
## # ... with 1,694 more rows

10.2.2 Pick rows with `filter()`

The filter() function takes two arguments: a tibble to transform, and a set of tests. It will return each row for which the test is TRUE.

This code, for instance, will look at the gapminder dataset and return all rows where country is equal to “Jordan”.

filter(gapminder, country == "Jordan")


## # A tibble: 12 x 6
##    country continent  year lifeExp     pop gdpPercap
##    <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Jordan  Asia       1952    43.2  607914     1547.
##  2 Jordan  Asia       1957    45.7  746559     1886.
##  3 Jordan  Asia       1962    48.1  933559     2348.
##  4 Jordan  Asia       1967    51.6 1255058     2742.
##  5 Jordan  Asia       1972    56.5 1613551     2111.
##  6 Jordan  Asia       1977    61.1 1937652     2852.
##  7 Jordan  Asia       1982    63.7 2347031     4161.
##  8 Jordan  Asia       1987    65.9 2820042     4449.
##  9 Jordan  Asia       1992    68.0 3867409     3432.
## 10 Jordan  Asia       1997    69.8 4526235     3645.
## 11 Jordan  Asia       2002    71.3 5307470     3845.
## 12 Jordan  Asia       2007    72.5 6053193     4519.

Notice that there are two equal signs (==). Please note that when testing for equality, we use a double equal sign, (==). If you had used a single equal sign, that would be the assignment operator, i.e., you set an argument (like data = gapminder); when you use two equal signs, you are running a logical a test.

Test	Meaning
`x < y`	Less than
`x > y`	Greater than
`x == y`	Equal to
`x <= y`	Less than or equal to
`x >= y`	Greater than or equal to
`x != y`	Not equal to
`x %in% y`	In (group membership)
`is.na(x)`	Is missing
`!is.na(x)`	Is not missing

Your turn: Use filter() and logical tests to show:
1. The data for China
2. All data for countries in Africa
3. All cases (rows) where life expectancy is greater than 80

You can also use multiple conditions, and these will extract rows that meet every test. By default, if you separate the tests with a comma, R will consider this an “and” test and find rows that are both Jordan and greater than 2000.

filter(gapminder, country == "Jordan", year > 2000)


## # A tibble: 2 x 6
##   country continent  year lifeExp     pop gdpPercap
##   <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
## 1 Jordan  Asia       2002    71.3 5307470     3845.
## 2 Jordan  Asia       2007    72.5 6053193     4519.

If you have any programming experience, you can also use the common operators for “and” with “&”, “or” with “|”, and “not” with “!”:

Operator	Meaning
`a & b`	and
`a \| b`	or
`!a`	not

Your turn: Use filter() and logical tests to show:
1. India before 1970
2. Countries where life expectancy in 2007 is below 60
3. Countries where life expectancy in 2007 is below 60 and are not in Africa

Beware of some common mistakes! You can’t collapse multiple tests into one. Instead, use two separate tests:

# This won't work!
filter(gapminder, 1960 < year < 1980)

# This will work
filter(gapminder, 1960 < year, year < 1980)

Also, you can avoid stringing together lots of tests by using the %in% operator, which checks to see if a value is in a list of values.

# This works, but is tedious-- what if you wanted to pick a dozen countries?
filter(gapminder, 
       country == "Mexico" |  country == "United States" | country == "Canada" )

# This is more concise and easier to add other countries later
filter(gapminder, 
       country %in% c("Mexico", "United States", "Canada" ))

10.2.3 Sort data with `arrange()`

The arrange() verb sorts data. By default it sorts in ascending order, from minimum to maximum value:

gapminder %>% arrange(lifeExp)


## # A tibble: 1,704 x 6
##    country      continent  year lifeExp     pop gdpPercap
##    <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
##  1 Rwanda       Africa     1992    23.6 7290203      737.
##  2 Afghanistan  Asia       1952    28.8 8425333      779.
##  3 Gambia       Africa     1952    30    284320      485.
##  4 Angola       Africa     1952    30.0 4232095     3521.
##  5 Sierra Leone Africa     1952    30.3 2143249      880.
##  6 Afghanistan  Asia       1957    30.3 9240934      821.
##  7 Cambodia     Asia       1977    31.2 6978607      525.
##  8 Mozambique   Africa     1952    31.3 6446316      469.
##  9 Sierra Leone Africa     1957    31.6 2295678     1004.
## 10 Burkina Faso Africa     1952    32.0 4469979      543.
## # ... with 1,694 more rows

You can sort in descending order (max to min) by using the desc() for the column/variable you want sorted:

gapminder %>% arrange(desc(lifeExp))


## # A tibble: 1,704 x 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Japan            Asia       2007    82.6 127467972    31656.
##  2 Hong Kong, China Asia       2007    82.2   6980412    39725.
##  3 Japan            Asia       2002    82   127065841    28605.
##  4 Iceland          Europe     2007    81.8    301931    36181.
##  5 Switzerland      Europe     2007    81.7   7554661    37506.
##  6 Hong Kong, China Asia       2002    81.5   6762476    30209.
##  7 Australia        Oceania    2007    81.2  20434176    34435.
##  8 Spain            Europe     2007    80.9  40448191    28821.
##  9 Sweden           Europe     2007    80.9   9031088    33860.
## 10 Israel           Asia       2007    80.7   6426679    25523.
## # ... with 1,694 more rows

You can sort by multiple columns by specifying them in a comma separated list. For example, we can sort by continent first and then sort by lifeExp (life expectancy) in descending order within each continent:

gapminder %>% 
  arrange(continent, desc(lifeExp))


## # A tibble: 1,704 x 6
##    country   continent  year lifeExp      pop gdpPercap
##    <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Reunion   Africa     2007    76.4   798094     7670.
##  2 Reunion   Africa     2002    75.7   743981     6316.
##  3 Reunion   Africa     1997    74.8   684810     6072.
##  4 Libya     Africa     2007    74.0  6036914    12057.
##  5 Tunisia   Africa     2007    73.9 10276158     7093.
##  6 Reunion   Africa     1992    73.6   622191     6101.
##  7 Tunisia   Africa     2002    73.0  9770575     5723.
##  8 Mauritius Africa     2007    72.8  1250882    10957.
##  9 Libya     Africa     2002    72.7  5368585     9535.
## 10 Algeria   Africa     2007    72.3 33333216     6223.
## # ... with 1,694 more rows

10.2.4 Add new columns with `mutate()`

You create new columns, or variables, with the mutate() function. You can create a single new column of gdp in the gapminder tibble as follows:

mutate(gapminder, gdp = gdpPercap * pop)


## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap          gdp
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # ... with 1,694 more rows

And you can create multiple columns by including a comma-separated list of new columns to create:

mutate(gapminder, gdp = gdpPercap * pop,
                  pop_mill = round(pop / 1000000))


## # A tibble: 1,704 x 8
##    country     continent  year lifeExp      pop gdpPercap          gdp pop_mill
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>    <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.        8
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.        9
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.       10
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.       12
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.       13
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.       15
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.       13
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.       14
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.       16
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.       22
## # ... with 1,694 more rows

You can also run logical, conditional tests within mutate() using the ifelse() function. This works like the =IF function in Excel and it takes three arguments:

a logical test,
what happens if the test is true, and
what happens if the test is false:

ifelse(TEST, VALUE_IF_TRUE, VALUE_IF_FALSE)

We can create a new column that is a binary (TRUE/FALSE) indicator for whether year is after 1960:

mutate(gapminder, after_1960 = ifelse(year > 1960, TRUE, FALSE))


## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap after_1960
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <lgl>     
##  1 Afghanistan Asia       1952    28.8  8425333      779. FALSE     
##  2 Afghanistan Asia       1957    30.3  9240934      821. FALSE     
##  3 Afghanistan Asia       1962    32.0 10267083      853. TRUE      
##  4 Afghanistan Asia       1967    34.0 11537966      836. TRUE      
##  5 Afghanistan Asia       1972    36.1 13079460      740. TRUE      
##  6 Afghanistan Asia       1977    38.4 14880372      786. TRUE      
##  7 Afghanistan Asia       1982    39.9 12881816      978. TRUE      
##  8 Afghanistan Asia       1987    40.8 13867957      852. TRUE      
##  9 Afghanistan Asia       1992    41.7 16317921      649. TRUE      
## 10 Afghanistan Asia       1997    41.8 22227415      635. TRUE      
## # ... with 1,694 more rows

We can also use text labels instead of TRUE and FALSE:

mutate(gapminder, 
       after_1960 = ifelse(year > 1960, "After 1960", "Before 1960"))


## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap after_1960 
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <chr>      
##  1 Afghanistan Asia       1952    28.8  8425333      779. Before 1960
##  2 Afghanistan Asia       1957    30.3  9240934      821. Before 1960
##  3 Afghanistan Asia       1962    32.0 10267083      853. After 1960 
##  4 Afghanistan Asia       1967    34.0 11537966      836. After 1960 
##  5 Afghanistan Asia       1972    36.1 13079460      740. After 1960 
##  6 Afghanistan Asia       1977    38.4 14880372      786. After 1960 
##  7 Afghanistan Asia       1982    39.9 12881816      978. After 1960 
##  8 Afghanistan Asia       1987    40.8 13867957      852. After 1960 
##  9 Afghanistan Asia       1992    41.7 16317921      649. After 1960 
## 10 Afghanistan Asia       1997    41.8 22227415      635. After 1960 
## # ... with 1,694 more rows

Your turn: Use mutate() to:
1. Add an africa column that is TRUE if the country is on the African continent
2. Add a column log_GDP for the logarithm of GDP per capita, using log(gdpPercap)
3. Add an africa_asia column that says “Africa or Asia” if the country is in Africa or Asia, and “Not Africa or Asia” if it’s not

10.2.5 Combine multiple verbs with pipes (`%>%`)

What if you want to include only rows from 2002 and make a new column with the logged GDP per capita? Doing this requires both filter() and mutate(), so we need to find a way to use both at once.

One solution is to use intermediate data frames for each step:

gapminder_2002_filtered <- filter(gapminder, year == 2002)

gapminder_2002_logged <- mutate(gapminder_2002_filtered, log_gdpPercap = log(gdpPercap))

That works fine, but your environment panel will start getting full of lots of intermediate data frames.

Another solution is to nest the functions inside each other. Remember that all dplyr functions return data frames, so you can feed the results of one into another:

filter(mutate(gapminder, log_gdpPercap = log(gdpPercap)), 
       year == 2002)

That works too, but it gets really complicated once you have even more functions, and it’s hard to keep track of which function’s arguments go where. I’d avoid doing this entirely.

One really nice solution is to use the pipe operator, or %>%. The pipe takes an object on the left and passes it as the first argument of the function on the right.

# gapminder will automatically get placed in the _____ spot
gapminder %>% filter(_____, country == "Jordan")

These two lines of code do the same thing:

filter(gapminder, country == "Jordan")

gapminder %>% filter(country == "Jordan")

Using pipes, you always start with a data frame, pass it to one verb to do one thing, then pass the output of that verb (a dataframe) to the next verb that will do something else, and so on. When reading any code with a %>%, it’s easiest to read the %>% as “and then”. This would read:

Take the gapminder dataset and then filter it so that it only has rows from 2002 and then add a new column (mutate) with the logged GDP per capita

gapminder %>% 
  filter(year == 2002) %>% 
  mutate(log_gdpPercap = log(gdpPercap))

10.2.6 Summarise data by groups with `group_by() %>% summarise()`

The summarise() verb takes an entire frame and collapses all of the rows in a single number as it calculates summary information about it. For instance, the following code will start with the entire gapminder data, calculate average life expectancy, and return just a single value, namely avarage life expectnacy among all countries and all years :

gapminder %>% summarize(mean_life = mean(lifeExp))


## # A tibble: 1 x 1
##   mean_life
##       <dbl>
## 1      59.5

You can also make multiple summary variables, just like mutate(), and it will return a column for each:

gapminder %>% summarize(mean_life = mean(lifeExp),
                        sd_life = sd(lifeExp),
                        min_life = min(lifeExp),
                        max_life = max(lifeExp)
                        )


## # A tibble: 1 x 4
##   mean_life sd_life min_life max_life
##       <dbl>   <dbl>    <dbl>    <dbl>
## 1      59.5    12.9     23.6     82.6

Your turn: Use summarise() to calculate:
1. The first (minimum) year in the gapminder dataset
2. The last (maximum) year in the dataset
3. The number of rows in the dataset (use the dplyr cheatsheet)
4. The number of distinct countries in the dataset (use the dplyr cheatsheet)
5. Use filter() and summarise() to calculate the median, minimum, and maximum life expectancy on the African continent in 2007

Again, remember that summarise() on its own summarises the entire dataset, so you only get numbers in a single row. These values can be what you want, e.g., averages, standard deviations, and min/max values for the entire dataset. If you group your data into separate subgroups with group_by(), you can use summarise() to calculate summary statistics for each group.

The group_by() function puts rows into groups based on values in a column. If you run:

gapminder %>% group_by(continent)


## # A tibble: 1,704 x 6
## # Groups:   continent [5]
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ... with 1,694 more rows

…you won’t see anything different! R has put the dataset into separate invisible groups behind the scenes, but you haven’t done anything with those groups, so nothing has really happened. If you do things with those groups with summarise(), though, group_by() becomes much more useful.

For instance, this will take the gapminder data frame, group it by continent, and then summarize it by calculating the number of distinct countries in each group. It will return one row for each group, so there should be a row for each continent:

gapminder %>% 
  group_by(continent) %>% 
  summarize(n_countries = n_distinct(country))

## # A tibble: 5 x 2
##   continent n_countries
##   <fct>           <int>
## 1 Africa             52
## 2 Americas           25
## 3 Asia               33
## 4 Europe             30
## 5 Oceania             2

You can calculate multiple summary statistics, as before:

gapminder %>% 
  group_by(continent) %>% 
  summarize(n_countries = n_distinct(country),
            avg_life_exp = mean(lifeExp))

## # A tibble: 5 x 3
##   continent n_countries avg_life_exp
##   <fct>           <int>        <dbl>
## 1 Africa             52         48.9
## 2 Americas           25         64.7
## 3 Asia               33         60.1
## 4 Europe             30         71.9
## 5 Oceania             2         74.3

Your turn:
1. Calculate summary statistics for life expectancy for each continent. Calculate minimum, maximum, median, mean, and standard deviation, and total count (n)
2. Do the same, but for the year 2007 only

Finally, you can group by multiple columns and R will create subgroups for every combination of the groups and return the number of rows of combinations. For instance, we can calculate the average life expectancy by both year and continent and we’ll get 60 rows, since there are 5 continents and 12 years (5 × 12 = 60):

gapminder %>% 
  group_by(continent, year) %>% 
  summarize(avg_life_exp = mean(lifeExp))

## # A tibble: 60 x 3
## # Groups:   continent [5]
##    continent  year avg_life_exp
##    <fct>     <int>        <dbl>
##  1 Africa     1952         39.1
##  2 Africa     1957         41.3
##  3 Africa     1962         43.3
##  4 Africa     1967         45.3
##  5 Africa     1972         47.5
##  6 Africa     1977         49.6
##  7 Africa     1982         51.6
##  8 Africa     1987         53.3
##  9 Africa     1992         53.6
## 10 Africa     1997         53.6
## # ... with 50 more rows

A common mistake I have seen is that people use the summarise() function before any group_by(). Rememebr that if you summarise() first, you collapse the entire dataframe into a single row, so there is no group_by() that can be done on a single row of data!!

10.3 Further resources

dplyr and tidyr cheat sheet for examples.

This page last updated on: 2020-07-16

10. Manipulate data with dplyr

10.1 The pipe operator, or %>%

10.2 Key functions in dplyr

10.2.1 Pick columns with select()

10.2.2 Pick rows with filter()

10.2.3 Sort data with arrange()

10.2.4 Add new columns with mutate()

10.2.5 Combine multiple verbs with pipes (%>%)

10.2.6 Summarise data by groups with group_by() %>% summarise()