In this section we talk about functions and iterations, programming tools to automate tasks and make you more efficient with your R programming.

3.1 Introduction

Functions allow you to reduce code repetition and duplication by automating a task. Whenever you find yourself repeating a task or copy-and-pasteing code there is a good chance that you should write a function. Iterations allow you to do the same thing to multiple inputs; repeating the same operation on different variables (columns) or different datasets.

3.2 Functions

Functions enable you to avoid repetition, copying-and-pasting code, and increase the efficiency of your code writing. You have already used functions from ggplot2, tidyr, dplyr, etc. to visualise, reshape, and wrangle your data, but these functions were written for you.

In this section you will learn how to create your own functions. The main reasons to write your own functions are clarity, reusability, and error minimisation.

  • Functions are self-documenting. A good name for your function allows you to easily remember the function and its purpose. Clarity in your code creates clarity in your data analysis process, something necessary when your code will be read and interpreted by others, or your future-self a few months later.
  • Functions allow you to reuse code. If an update to the code is necessary, you revise your function definition, and the changes immeidately apply to any analyses that implement the function. Otheriwse, without a user defined function, your duplicated code becomes tedious to edit and invariably mistakes will happen in the cut-and-paste editing process.
  • Functions reduce potential errors. There are fewer chances to make mistakes because the code only exists in one location. When copying and pasting, you may forget to copy or fail to update a line in one location.

Good rule of thumb: if you have copied and pasted a block of code more than twice, it is time to convert it to a function.

3.2.1 Writing a function

Functions have three key components:

  1. A name. This should be informative and describe what the function does
  2. The arguments, or list of inputs, to the function. They go inside the parentheses in function().
  3. The body. This is the block of code within {} that immediately follows function(...), and is the code that you developed to perform the action described in the name using the arguments you provide.
function_name <- function( argument1, argument2, ETC ) {
  statement1
  statement2
  ETC
  return( output )
}

What this does is create a function with the name function_name, which has arguments, or takes as inputs, argument1, argument2 and so forth. Whenever the function is called, R executes the statements in the curly braces, and then outputs the contents of output to the user.

3.2.2 A simple example

To give a simple example of this, let’s create a function called sum_of_squares which calculates the sum of the squares of two numbers. For instance, if we invoke our function with the numbers 3 and 4 as arguments, we would expect to get \(3^2 + 4^2 = 25\) as output.

  • Name - sum_of_squares
    • Calculates the sum of the squared value of two variables
  • Arguments
    • x - one number
    • y - a second number
  • Body
    • The first line squares x and y, adds the results together, and returns the sum.

To make sure our function works.

sum_of_squares(3, 4)
## [1] 25

An important thing to recognise is that the two internal variables x and y that our function uses, stay internal to the function and at no point do either of these variables get created in the workspace.

Vectorisation allows you to do cool things- this function works with vectors of numbers

x <- c(2, 4, 6)
y <- c(1, 3, 5)
sum_of_squares(x, y)
## [1]  5 25 61

3.2.3 Present Value (PV) function

In finance, given an opportunity cost of capital \(r\), we are able to calculate the present value (PV) today of a future cashflow (FV) that will occur \(n\) years later. So if I promised to pay you 1,000 in $n = 5% years’ time, and the opportunity cost of capital \(r\) were 0.08 (or 8%), how much is this worth in today’s money?

First, let us define the PV function

PV <- function(FV, r, n) {
  PV <- FV/(1+r)^n
  round(PV, 2)
}

To invoke the PV() function we can pass the arguments in different ways.

First, we write out the arguments names explicitly

# using argument names
PV(FV = 1000, r = .08, n = 5)
## [1] 681

We can also invoke the function without using names. This is known as positional matching and the arguments we pass should match the order in which they were defined. In our case, the first argument is the future value FV, followed by the opportunity cost of capital r, and the number of years n

# same as above but without using names (aka "positional matching")
PV(1000, .08, 5)
## [1] 681

If you are not using argument names, you must insert arguments in proper order

# in the function assumes FV = .08, r = 1000, and n = 5
PV(.08, 1000, 5)
## [1] 0

If using argument names, you can change the order of the arguments; R will match them and you do not have to remember their order.

# if using names you can change the order
PV(r = .08, FV = 1000, n = 5)
## [1] 681

Again, vectorisation allows us to calculate the Present Value PV not just for one Future Value (FV), but rather a collection of future values as shown below.

FV  <-  seq(from = 1000, to = 10000, by = 1000)
FV
##  [1]  1000  2000  3000  4000  5000  6000  7000  8000  9000 10000
PV(FV, r = .08,  n = 5)
##  [1]  681 1361 2042 2722 3403 4084 4764 5445 6125 6806

3.2.4 Dealing with Invalid Arguments

For functions that will be used by someone other than the creator of the function, it is good to check the validity of arguments within the function. One way to do this is to use the stop() function which uses an if() statement to check if the class of each argument is numeric. In our case, if one or more arguments to the PV function are not numeric then the stop() function will stop execution of the current function and provide an error message to the user.

PV <- function(FV, r, n) {
        if(!is.numeric(FV) | !is.numeric(r) | !is.numeric(n)){
                stop('This function only works for numeric inputs!\n', 
                     'You have provided objects of the following classes:\n', 
                     'FV: ', class(FV), '\n',
                     'r: ', class(r), '\n',
                     'n: ', class(n))
        }
        
        PV <- FV/(1+r)^n
        round(PV, 2)
}


PV("future value of 1,000", 0.08, "five")

3.2.5 A user-defined function for mapping dates to time periods

Dealing with dates and times can be challenging in R; however, there is a package called lubridate which does a wonderful job of handling dates; if you specify a date in the ISO 8601 format, where date and time values are ordered from the largest to smallest unit of time, such as year-month-day, you can easily handle all sorts of operations, like finding the day, month, year, etc.

library(lubridate)

# use lubridate::ymd (year-month-day) function to cast a string of characters as a date
test_date <- ymd("2019-12-31")

# use lubridate::year to return the year of a given date
year(test_date)
## [1] 2019
# use lubridate::month to return the month of a given date
month(test_date)
## [1] 12
# use lubridate::day to return the month of a given date
day(test_date)
## [1] 31

However, lubridate offers no season function that would allow us to find the season (summer, winter, etc.) given a date. We can define a function and let us work through the function’s three key components:

  1. A name. In this cae, our function is called season which is informative and describe what the function does
  2. The arguments, or list of inputs, to the function. Our function takes two inpus, timedate and convention, namely northern_hemisphere, southern_hemisphere, or month_initials. If we do not supply a value for convention, the function assumes a default value for convention of northern_hemisphere
  3. The body. This is the block of code that performs the mapping from a date to a season. Notice, that if the value of convention is anything other than northern_hemisphere, southern_hemisphere, or month_initials, the function would stop and return an error message.
season <- function(timedate, convention = "northern_hemisphere") {
  season_terms <- switch(convention, 
                         "northern_hemisphere" = c("spring", "summer", "autumn", "winter"),
                         "southern_hemisphere" = c("autumn", "winter", "spring", "summer"),
                         "month_initials"      = c("Mar Apr May",    
                                                   "Jun Jul Aug",    
                                                   "Sep Oct Nov",    
                                                   "Dec Jan Feb"),
                         stop("Wrong value of convention")
  )
  
  m <- month(timedate)
  s <- factor(character(length(m)), levels = season_terms)
  s[m %in% c( 3,  4,  5)] <- season_terms[1] #assign to months 3-4-5, or Mar-Apr-May
  s[m %in% c( 6,  7,  8)] <- season_terms[2] #assign to months 6-7-8, or Jun-Jul-Aug
  s[m %in% c( 9, 10, 11)] <- season_terms[3] #assign to months 9-10-11, or Sep-Oct-Nov
  s[m %in% c(12,  1,  2)] <- season_terms[4] #assign to months 12-1-2, or Dec-Jan-Feb
  s
}

#check season for new year's eve in the northern hemisphere
season(test_date, "northern_hemisphere")
## [1] winter
## Levels: spring summer autumn winter
#check season for new year's eve in the southern hemisphere
season(test_date, "southern_hemisphere")
## [1] summer
## Levels: autumn winter spring summer
#check month initials for new year's eve
season(test_date, "month_initials")
## [1] Dec Jan Feb
## Levels: Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb
#check whether the function stops if the argument for convention is not among the known values
season(test_date, "my_hemisphere")
## Error in season(test_date, "my_hemisphere"): Wrong value of convention

3.2.6 Conditional execution: if and if_else for conditional programming

Sometimes you only want to execute code if a condition is met. To do that, use an if-else statement.

if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}

condition must always evaluate to either TRUE or FALSE. This is similar to filter(), except condition can only be a single value (i.e. a vector of length 1), whereas filter() works for entire vectors (or columns).

You can nest as many if...else conditional statements as required. For example:

if (this) {
  # do that
} else if (that) {
  # do something else
} else {
  # do something entirely different
}


x <- 7

if(x >= 10){
        print("x exceeds acceptable tolerance levels")
} else if(x >= 0 & x < 10){
        print("x is within acceptable tolerance levels")
} else {
         print("x is negative")
}

This can get tedious if you need to consider many conditions. There are alternatives in R for some of these long conditional statements. For instance, if you want to convert a continuous (or numeric) variable to categories, use cut(), which divides the range of x into intervals and codes the values of x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.

library(gapminder)
gapminder %>%
  select(country, lifeExp) %>%
  mutate(lifeExp_autobin = cut(lifeExp, breaks = 5),
         lifeExp_manbin = cut(lifeExp,
                            breaks = c(0, 30, 50, 62, 75, 90),
                            labels = c("Very Short", "Short", "Average", "Long", "Very Long")))
## # A tibble: 1,704 x 4
##    country     lifeExp lifeExp_autobin lifeExp_manbin
##    <fct>         <dbl> <fct>           <fct>         
##  1 Afghanistan    28.8 (23.5,35.4]     Very Short    
##  2 Afghanistan    30.3 (23.5,35.4]     Short         
##  3 Afghanistan    32.0 (23.5,35.4]     Short         
##  4 Afghanistan    34.0 (23.5,35.4]     Short         
##  5 Afghanistan    36.1 (35.4,47.2]     Short         
##  6 Afghanistan    38.4 (35.4,47.2]     Short         
##  7 Afghanistan    39.9 (35.4,47.2]     Short         
##  8 Afghanistan    40.8 (35.4,47.2]     Short         
##  9 Afghanistan    41.7 (35.4,47.2]     Short         
## 10 Afghanistan    41.8 (35.4,47.2]     Short         
## # ... with 1,694 more rows

3.2.6.1 if versus if_else()

Because if-else conditional statements like the ones outlined above must always resolve to a single TRUE or FALSE, they cannot be used for vector operations. Vector operations are where you make multiple comparisons simultaneously for each value stored inside a vector. Consider the gapminder data and imagine you wanted to create a new column identifying whether for a particular country and year, life exectancy was greater than 50 years. This sounds like a classic if-else operation. For each country, if lifeExp > 50, then the value in the new column should be “Life expectancy greater than 50”. But what happens if we try to implement this using an if-else operation?

(lifeExp_if <- gapminder %>%
   mutate(over50 = if(lifeExp > 50){
     "Life expectancy greater than 50"
   } else {
     "Life expectancy less than 50"
   }))
## Warning in if (lifeExp > 50) {: the condition has length > 1 and only the first element will be used
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap over50                      
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <chr>                       
##  1 Afghanistan Asia       1952    28.8  8425333      779. Life expectancy less than 50
##  2 Afghanistan Asia       1957    30.3  9240934      821. Life expectancy less than 50
##  3 Afghanistan Asia       1962    32.0 10267083      853. Life expectancy less than 50
##  4 Afghanistan Asia       1967    34.0 11537966      836. Life expectancy less than 50
##  5 Afghanistan Asia       1972    36.1 13079460      740. Life expectancy less than 50
##  6 Afghanistan Asia       1977    38.4 14880372      786. Life expectancy less than 50
##  7 Afghanistan Asia       1982    39.9 12881816      978. Life expectancy less than 50
##  8 Afghanistan Asia       1987    40.8 13867957      852. Life expectancy less than 50
##  9 Afghanistan Asia       1992    41.7 16317921      649. Life expectancy less than 50
## 10 Afghanistan Asia       1997    41.8 22227415      635. Life expectancy less than 50
## # ... with 1,694 more rows

This did not work correctly. Because if() can only handle a single TRUE/FALSE value, it only checked the first row of the data frame. That row, Afghanistan in 1952, contained 28.801 for lifeExp, so it generated a vector of length 1704 with every single value being “Life expectancy less than 50”.

dplyr::count(lifeExp_if, over50)
## # A tibble: 1 x 2
##   over50                           n
##   <chr>                        <int>
## 1 Life expectancy less than 50  1704

If we wanted to make this if-else comparison 1704 times, we should instead use if_else(). This vectorizes the if-else comparison and makes a separate comparison for each row of the data frame. This allows us to correctly generate this new column.1

(lifeExp_ifelse <- gapminder %>%
  mutate(over50 = if_else(lifeExp > 50, 
                          "Life expectancy greater than 50",  
                          "Life expectancy less than 50")))
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap over50                      
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <chr>                       
##  1 Afghanistan Asia       1952    28.8  8425333      779. Life expectancy less than 50
##  2 Afghanistan Asia       1957    30.3  9240934      821. Life expectancy less than 50
##  3 Afghanistan Asia       1962    32.0 10267083      853. Life expectancy less than 50
##  4 Afghanistan Asia       1967    34.0 11537966      836. Life expectancy less than 50
##  5 Afghanistan Asia       1972    36.1 13079460      740. Life expectancy less than 50
##  6 Afghanistan Asia       1977    38.4 14880372      786. Life expectancy less than 50
##  7 Afghanistan Asia       1982    39.9 12881816      978. Life expectancy less than 50
##  8 Afghanistan Asia       1987    40.8 13867957      852. Life expectancy less than 50
##  9 Afghanistan Asia       1992    41.7 16317921      649. Life expectancy less than 50
## 10 Afghanistan Asia       1997    41.8 22227415      635. Life expectancy less than 50
## # ... with 1,694 more rows
dplyr::count(lifeExp_ifelse, over50)
## # A tibble: 2 x 2
##   over50                              n
##   <chr>                           <int>
## 1 Life expectancy greater than 50  1213
## 2 Life expectancy less than 50      491

3.2.7 Saving and Reusing Functions

If you want to save a function to be used at other time, you can either create a package, or you can save your function in an R script file. As an example, we can save the PV() function from the previous section as an R script file named PV.R. Then, provided the script file is in the same project or working directory, we can load the code for our function by typing source("PV.R").

3.3 Iteration

Iteration automates a multi-step process by organizing sequences of R commands. R comprises several iteration control techniques that allow you to perform repetititve tasks with different goals. These techniques are

  • while loop to iterate until a logical statement returns FALSE
  • for loop to iterate over a fixed number of iterations
  • repeat loop to execute until told to break
  • break/next arguments to exit and skip interations in a loop

3.3.1 The while loop

A while loop is a simple thing. The basic format of the loop looks like this:

while (condition ) {
  statement_1
  statement_2
  ETC 
}

The code corresponding to condition needs to produce a logical value, either TRUE or FALSE. Whenever R encounters a while statement, it checks to see if the condition is TRUE. If it is, then R goes on to execute all of the commands inside the curly brackets, proceeding from top to bottom as usual. However, when it gets to the bottom of those statements, it moves back up to the while statement and checks if the condition is TRUE. If it is, then R continues until at some point the condition turns out to be FALSE. Once that happens, R jumps to the bottom of the loop (i.e., to the } character), and then continues on with whatever commands appear next in the script.

3.3.1.1 A loan example

Suppose you get a loan of 20,000 at an interest rate of 5% and agree to pay to the bank 375 every month over the next 5 years, or 60 months. Will you be able to repay your loan? We could simulate the whole process with R to tell us what is going to happen.

# Specify the variables that define the problem
month <- 0          # count the number of months
balance <- 20000    # initial mortgage balance
payments <- 375     # amount of monthly payment
interest <- 0.05    # 5% interest rate per year
total_paid <- 0     # track what you have paid the bank

# convert annual interest to a monthly multiplier. Since we have compound interest, it's not just just 5%/12, but rather
# the monthly rate x is the value that makes (1+x)^12 = 1.05
monthly_multiplier <- (1 + interest) ^ (1/12)

# keep looping until the loan is paid off...
while(balance > 0){
  
  # do the calculations for this month
  month <- month + 1  # one more month
  
  balance <- balance * monthly_multiplier # add the interest 
  balance <- balance - payments # make the payments 
  total_paid <- total_paid + payments # track the total paid
  
  # print the results on screen
  cat("month", month, ": balance", round(balance), "\n")
  
}

# print the total payments at the end
cat("total payments made", total_paid, "\n")

When we run this code, R checks the condition that balance > 0. Since the starting value of balance is 20000, the condition is TRUE, so it enters the body of the loop (inside the curly braces). The commands here instruct R to increase the value of month by 1, and to do all the book-keepig necessary for the loan, namely take the starting balance, add the interest, subtract the payment, calcualte the new balance and keep track of the total amount paid. R then returns to the top of the loop, and rechecks the condition that balance > 0.

3.3.2 The for loop

The for loop is also pretty simple, though not quite as simple as the while loop. The basic format of this loop goes like this:

for ( var in vector ) {
  statement_1
  statement_2
  ETC 
}

In a for loop, R runs a fixed number of iterations. We have a vector which has several elements, each one corresponding to a possible value of the variable var. In the first iteration of the loop, var is given a value corresponding to the first element of vector; in the second iteration of the loop var gets a value corresponding to the second value in vector; and so on. Once we have exhausted all of the values in the vector, the loop terminates and the flow of the program continues.

Let’s say we want to print out the first ten letters of the English alphabet. We want to start with the first letter, and then the second one, etc; in other words, what we want to do print(i) for every i within the range spanned by 1:10, and then print the answer to the console. Because we have a fixed range of values that we want to loop over, this situation is well-suited to a for loop. Here’s the code:

for(i in letters[1:10]){
  print(i)
}
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## [1] "f"
## [1] "g"
## [1] "h"
## [1] "i"
## [1] "j"

The intuition here is that R starts by setting i to 1. It then takes letters and prints its first value, namely a, then moves back to the top of the loop. When it gets there, it increases i by 1, and then repeats the calculation. It keeps doing this until i reaches 10 and then it stops.

3.3.3 Example for loop

Let us create a dataframe with three variables (columns) that contain 100 random numbers drawn from the standard Normal distribution with a mean = 0 and standard deviation = 1 N(0,1)

df <- tibble(
  a = rnorm(100),
  b = rnorm(100),
  c = rnorm(100)
)

We can to compute the median for each column, using base R

median(df$a)
## [1] -0.0333
median(df$b)
## [1] -0.104
median(df$c)
## [1] -0.15

We’ve copied-and-pasted median() three times, but we could instead use a for loop:

output <- vector(mode = "double", length = ncol(df))
for (i in seq_along(df)) {
  output[[i]] <- median(df[[i]])
}
output
## [1] -0.0333 -0.1040 -0.1498

3.4 dplyr::mutate

To edit or add columns to a dataframe, it’s easier to use the mutate function from the dplyr package. If we want to create a new column gdp in the gapminder dataframe, expressed in billions of US$. Given that gapminder has data on the population pop and GDP per capita gdpPercap, instead of creating a for loop, we can get a column with the total GDP as shown below

library(dplyr)
gapminder %>% 
  mutate (GDP = pop * gdpPercap / 1000000000)
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap   GDP
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl> <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6.57
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7.59
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8.76
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9.65
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9.68
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11.7 
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12.6 
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11.8 
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10.6 
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14.1 
## # ... with 1,694 more rows

3.5 Scoped verbs

Frequently when working with data frames you may wish to apply a specific function to multiple columns. For instance, calculating the average value of each column in mtcars. If we wanted to calculate the average of a single column, it would be pretty simple using just dplyr::summarise function:

mtcars %>%
  summarise(mpg = mean(mpg))
##    mpg
## 1 20.1

If we want to calculate the mean for all variables, we’d have to duplicate this code many times:

mtcars %>%
  summarise(mpg = mean(mpg),
            cyl = mean(cyl),
            disp = mean(disp),
            hp = mean(hp),
            drat = mean(drat),
            wt = mean(wt),
            qsec = mean(qsec),
            vs = mean(vs),
            am = mean(am),
            gear = mean(gear),
            carb = mean(carb))
##    mpg  cyl disp  hp drat   wt qsec    vs    am gear carb
## 1 20.1 6.19  231 147  3.6 3.22 17.8 0.438 0.406 3.69 2.81

This is very repetitive (and boring!), but more importantly it is prone to mistakes. We can use loops and map() functions to find means for all variables, but we can also do it using scoped verbs.

Scoped verbs allow you to use standard verbs (or functions) in dplyr that affect multiple variables at once, combining both elements of repetition and (in some cases) conditional expressions:

  • _if allows you to pick variables based on a predicate function like is.numeric() or is.character()
  • _at allows you to pick variables using the same syntax as select()
  • _all operates on all variables

3.5.1 summarise

3.5.1.1 summarise_all()

summarize_all() takes a dataframe and a function and applies that function to each column:

summarise_all(mtcars, .funs = mean)
##    mpg  cyl disp  hp drat   wt qsec    vs    am gear carb
## 1 20.1 6.19  231 147  3.6 3.22 17.8 0.438 0.406 3.69 2.81

If you want to apply multiple summaries, use the funs() helper. In this case, we want to get the min/max values, the mean and the standard deviation (sd) for all variables

summarise_all(mtcars, .funs = funs(min, max, mean, sd))
##   mpg_min cyl_min disp_min hp_min drat_min wt_min qsec_min vs_min am_min gear_min carb_min mpg_max cyl_max disp_max hp_max drat_max
## 1    10.4       4     71.1     52     2.76   1.51     14.5      0      0        3        1    33.9       8      472    335     4.93
##   wt_max qsec_max vs_max am_max gear_max carb_max mpg_mean cyl_mean disp_mean hp_mean drat_mean wt_mean qsec_mean vs_mean am_mean
## 1   5.42     22.9      1      1        5        8     20.1     6.19       231     147       3.6    3.22      17.8   0.438   0.406
##   gear_mean carb_mean mpg_sd cyl_sd disp_sd hp_sd drat_sd wt_sd qsec_sd vs_sd am_sd gear_sd carb_sd
## 1      3.69      2.81   6.03   1.79     124  68.6   0.535 0.978    1.79 0.504 0.499   0.738    1.62

You can combine this with group_by() to calculate group-level summary statistics:

mtcars %>%
  group_by(gear) %>%
  summarise_all(.funs = mean)
## # A tibble: 3 x 11
##    gear   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  carb
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     3  16.1  7.47  326. 176.   3.13  3.89  17.7 0.2   0      2.67
## 2     4  24.5  4.67  123.  89.5  4.04  2.62  19.0 0.833 0.667  2.33
## 3     5  21.4  6     202. 196.   3.92  2.63  15.6 0.2   1      4.4

3.5.1.2 summarise_at()

summarise_at() allows you to pick columns in the same way as select(), that is, based on their names. There is one small difference: you need to wrap the complete selection with the vars() helper (this avoids ambiguity). In the following example, we calculate the mean for all variables, but exclude mpg.

summarise_at(mtcars, .vars = vars(-mpg), .funs = mean)
##    cyl disp  hp drat   wt qsec    vs    am gear carb
## 1 6.19  231 147  3.6 3.22 17.8 0.438 0.406 3.69 2.81

By default, the newly created columns have the shortest names needed to uniquely identify the output.

# calculate min/max for just the mpg variable
summarise_at(mtcars, .vars = vars(mpg), .funs = funs(min, max))
##    min  max
## 1 10.4 33.9
# calculate min for just the mpg and wt variables
summarise_at(mtcars, .vars = vars(mpg, wt), .funs = min)
##    mpg   wt
## 1 10.4 1.51
# calculate min/max for all but the mpg variable
summarise_at(mtcars, .vars = vars(-mpg), .funs = funs(min, max))
##   cyl_min disp_min hp_min drat_min wt_min qsec_min vs_min am_min gear_min carb_min cyl_max disp_max hp_max drat_max wt_max qsec_max
## 1       4     71.1     52     2.76   1.51     14.5      0      0        3        1       8      472    335     4.93   5.42     22.9
##   vs_max am_max gear_max carb_max
## 1      1      1        5        8

3.5.1.3 summarise_if()

summarise_at() allows you to pick variables to summarize based on their name, whereas summarise_if() allows you to pick variables to summarize based on some property of the column. Typically this is their type because you want to (e.g.) apply a numeric summary function only to numeric columns:

# group gapminder by country and calculate the mean for numeric variables
gapminder %>%
  group_by(country) %>%
  summarise_if(.predicate = is.numeric, .funs = mean, na.rm = TRUE)
## # A tibble: 142 x 5
##    country      year lifeExp       pop gdpPercap
##    <fct>       <dbl>   <dbl>     <dbl>     <dbl>
##  1 Afghanistan 1980.    37.5 15823715.      803.
##  2 Albania     1980.    68.4  2580249.     3255.
##  3 Algeria     1980.    59.0 19875406.     4426.
##  4 Angola      1980.    37.9  7309390.     3607.
##  5 Argentina   1980.    69.1 28602240.     8956.
##  6 Australia   1980.    74.7 14649312.    19981.
##  7 Austria     1980.    73.1  7583298.    20412.
##  8 Bahrain     1980.    65.6   373913.    18078.
##  9 Bangladesh  1980.    49.8 90755395.      818.
## 10 Belgium     1980.    73.6  9725119.    19901.
## # ... with 132 more rows
# group gapminder by continent and calculate min, max, and mean for numeric variables
gapminder %>%
  group_by(continent) %>%
  summarise_if(.predicate = is.numeric, .funs = funs(min, max, mean), na.rm = TRUE)
## # A tibble: 5 x 13
##   continent year_min lifeExp_min pop_min gdpPercap_min year_max lifeExp_max pop_max gdpPercap_max year_mean lifeExp_mean pop_mean
##   <fct>        <int>       <dbl>   <int>         <dbl>    <int>       <dbl>   <int>         <dbl>     <dbl>        <dbl>    <dbl>
## 1 Africa        1952        23.6   60011          241.     2007        76.4  1.35e8        21951.     1980.         48.9   9.92e6
## 2 Americas      1952        37.6  662850         1202.     2007        80.7  3.01e8        42952.     1980.         64.7   2.45e7
## 3 Asia          1952        28.8  120447          331      2007        82.6  1.32e9       113523.     1980.         60.1   7.70e7
## 4 Europe        1952        43.6  147962          974.     2007        81.8  8.24e7        49357.     1980.         71.9   1.72e7
## 5 Oceania       1952        69.1 1994794        10040.     2007        81.2  2.04e7        34435.     1980.         74.3   8.87e6
## # ... with 1 more variable: gdpPercap_mean <dbl>

(Note that na.rm = TRUE is passed on to mean() in the same way as in purrr::map().)

3.5.2 Mutate

mutate_all(), mutate_if() and mutate_at() work in a similar way to their summarise equivalents.

As an example, say we had a dataframe with various columns in 2 groups: the first group contains weights weight1, weight2, weight3, etc., and the second group contains the actual data data1 and data2.

If we wanted to cross multiply weights * data, rather than writing out for loops, we can use mutate_at to create any combination of columns of the two groups.

df <- tibble(
  # weights: 100 values, from a uniform distribution between 0 and 1 
  weight1 = runif(100, min = 0, max = 1), 
  weight2 = runif(100, min = 0, max = 1),
  weight3 = runif(100, min = 0, max = 1),
  weight4 = runif(100, min = 0, max = 1),
  
  # data 1: 100 values, from a normal distribution with mean = 100 and sd=20 
  data1 = rnorm(100, mean = 100, sd = 20), 
  
  # data 2: 100 values, from a normal distribution with mean = 250 and sd=90 
  data2 = rnorm(100, mean = 250, sd = 90)
)

df <- df %>%
  mutate_at(. , vars(starts_with("weight")), funs(prod1 = . * data1, prod2 = . * data2))

3.5.3 Filter

filter_all() is the most useful of the three filter() variants. You use it in conjunction with all_vars() or any_vars() depending on whether or not you want rows where all variables meet the criterion, or where just one variable meets it.

It’s particularly useful finding missing values:

library(nycflights13)

# Rows where any value is missing
filter_all(weather, .vars_predicate = any_vars(is.na(.)))
## # A tibble: 21,135 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour          
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>  <dbl>    <dbl> <dbl> <dttm>             
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4         NA      0    1012     10 2013-01-01 01:00:00
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06        NA      0    1012.    10 2013-01-01 02:00:00
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5         NA      0    1012.    10 2013-01-01 03:00:00
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7         NA      0    1012.    10 2013-01-01 04:00:00
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7         NA      0    1012.    10 2013-01-01 05:00:00
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5         NA      0    1012.    10 2013-01-01 06:00:00
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0         NA      0    1012.    10 2013-01-01 07:00:00
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4         NA      0    1012.    10 2013-01-01 08:00:00
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0         NA      0    1013.    10 2013-01-01 09:00:00
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8         NA      0    1012.    10 2013-01-01 10:00:00
## # ... with 21,125 more rows
# Rows where all wind variables are missing
filter_at(weather, .vars = vars(starts_with("wind")),
          .vars_predicate = all_vars(is.na(.)))
## # A tibble: 4 x 15
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour          
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>     <dbl>  <dbl>    <dbl> <dbl> <dttm>             
## 1 EWR     2013     3    27    17  52.0  19.0  27.0       NA         NA        NA      0    1012.  10   2013-03-27 17:00:00
## 2 JFK     2013     5    22    10  62.1  59    93.8       NA         NA        NA      0      NA    2.5 2013-05-22 10:00:00
## 3 JFK     2013     7     4     6  73.0  71.1  93.5       NA         NA        NA      0    1024.   6   2013-07-04 06:00:00
## 4 JFK     2013     7    20     6  81.0  71.1  71.9       NA         NA        NA      0    1010.  10   2013-07-20 06:00:00

3.7 Acknowledgements



This page last updated on: 2020-07-14


  1. Notice that is also preserves missing values in the new column. Remember, any operation performed on a missing value will itself become a missing value.↩︎