In this section we talk about functions and iterations, programming tools to automate tasks and make you more efficient with your R programming.
Functions allow you to reduce code repetition and duplication by automating a task. Whenever you find yourself repeating a task or copy-and-pasteing code there is a good chance that you should write a function. Iterations allow you to do the same thing to multiple inputs; repeating the same operation on different variables (columns) or different datasets.
Functions enable you to avoid repetition, copying-and-pasting code, and increase the efficiency of your code writing. You have already used functions from ggplot2
, tidyr
, dplyr
, etc. to visualise, reshape, and wrangle your data, but these functions were written for you.
In this section you will learn how to create your own functions. The main reasons to write your own functions are clarity, reusability, and error minimisation.
Good rule of thumb: if you have copied and pasted a block of code more than twice, it is time to convert it to a function.
Functions have three key components:
function()
.{}
that immediately follows function(...)
, and is the code that you developed to perform the action described in the name using the arguments you provide.function_name <- function( argument1, argument2, ETC ) {
statement1
statement2
ETC
return( output )
}
What this does is create a function with the name function_name
, which has arguments, or takes as inputs, argument1
, argument2
and so forth. Whenever the function is called, R executes the statements
in the curly braces, and then outputs the contents of output
to the user.
To give a simple example of this, let’s create a function called sum_of_squares
which calculates the sum of the squares of two numbers. For instance, if we invoke our function with the numbers 3 and 4 as arguments, we would expect to get \(3^2 + 4^2 = 25\) as output.
sum_of_squares
x
- one numbery
- a second numberx
and y
, adds the results together, and returns the sum.To make sure our function works.
## [1] 25
An important thing to recognise is that the two internal variables x
and y
that our function uses, stay internal to the function and at no point do either of these variables get created in the workspace.
Vectorisation allows you to do cool things- this function works with vectors of numbers
## [1] 5 25 61
In finance, given an opportunity cost of capital \(r\), we are able to calculate the present value (PV) today of a future cashflow (FV) that will occur \(n\) years later. So if I promised to pay you 1,000 in $n = 5% years’ time, and the opportunity cost of capital \(r\) were 0.08 (or 8%), how much is this worth in today’s money?
First, let us define the PV
function
To invoke the PV() function we can pass the arguments in different ways.
First, we write out the arguments names explicitly
## [1] 681
We can also invoke the function without using names. This is known as positional matching and the arguments we pass should match the order in which they were defined. In our case, the first argument is the future value FV
, followed by the opportunity cost of capital r
, and the number of years n
## [1] 681
If you are not using argument names, you must insert arguments in proper order
## [1] 0
If using argument names, you can change the order of the arguments; R will match them and you do not have to remember their order.
## [1] 681
Again, vectorisation allows us to calculate the Present Value PV not just for one Future Value (FV), but rather a collection of future values as shown below.
## [1] 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
## [1] 681 1361 2042 2722 3403 4084 4764 5445 6125 6806
For functions that will be used by someone other than the creator of the function, it is good to check the validity of arguments within the function. One way to do this is to use the stop()
function which uses an if()
statement to check if the class of each argument is numeric. In our case, if one or more arguments to the PV function are not numeric then the stop()
function will stop execution of the current function and provide an error message to the user.
PV <- function(FV, r, n) {
if(!is.numeric(FV) | !is.numeric(r) | !is.numeric(n)){
stop('This function only works for numeric inputs!\n',
'You have provided objects of the following classes:\n',
'FV: ', class(FV), '\n',
'r: ', class(r), '\n',
'n: ', class(n))
}
PV <- FV/(1+r)^n
round(PV, 2)
}
PV("future value of 1,000", 0.08, "five")
Dealing with dates and times can be challenging in R; however, there is a package called lubridate
which does a wonderful job of handling dates; if you specify a date in the ISO 8601 format, where date and time values are ordered from the largest to smallest unit of time, such as year-month-day
, you can easily handle all sorts of operations, like finding the day, month, year, etc.
library(lubridate)
# use lubridate::ymd (year-month-day) function to cast a string of characters as a date
test_date <- ymd("2019-12-31")
# use lubridate::year to return the year of a given date
year(test_date)
## [1] 2019
## [1] 12
## [1] 31
However, lubridate
offers no season function that would allow us to find the season (summer, winter, etc.) given a date. We can define a function and let us work through the function’s three key components:
season
which is informative and describe what the function doestimedate
and convention
, namely northern_hemisphere
, southern_hemisphere
, or month_initials
. If we do not supply a value for convention
, the function assumes a default value for convention of northern_hemisphere
convention
is anything other than northern_hemisphere
, southern_hemisphere
, or month_initials
, the function would stop and return an error message.season <- function(timedate, convention = "northern_hemisphere") {
season_terms <- switch(convention,
"northern_hemisphere" = c("spring", "summer", "autumn", "winter"),
"southern_hemisphere" = c("autumn", "winter", "spring", "summer"),
"month_initials" = c("Mar Apr May",
"Jun Jul Aug",
"Sep Oct Nov",
"Dec Jan Feb"),
stop("Wrong value of convention")
)
m <- month(timedate)
s <- factor(character(length(m)), levels = season_terms)
s[m %in% c( 3, 4, 5)] <- season_terms[1] #assign to months 3-4-5, or Mar-Apr-May
s[m %in% c( 6, 7, 8)] <- season_terms[2] #assign to months 6-7-8, or Jun-Jul-Aug
s[m %in% c( 9, 10, 11)] <- season_terms[3] #assign to months 9-10-11, or Sep-Oct-Nov
s[m %in% c(12, 1, 2)] <- season_terms[4] #assign to months 12-1-2, or Dec-Jan-Feb
s
}
#check season for new year's eve in the northern hemisphere
season(test_date, "northern_hemisphere")
## [1] winter
## Levels: spring summer autumn winter
#check season for new year's eve in the southern hemisphere
season(test_date, "southern_hemisphere")
## [1] summer
## Levels: autumn winter spring summer
## [1] Dec Jan Feb
## Levels: Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb
#check whether the function stops if the argument for convention is not among the known values
season(test_date, "my_hemisphere")
## Error in season(test_date, "my_hemisphere"): Wrong value of convention
if
and if_else
for conditional programmingSometimes you only want to execute code if a condition is met. To do that, use an if-else statement.
if (condition) {
# code executed when condition is TRUE
} else {
# code executed when condition is FALSE
}
condition
must always evaluate to either TRUE
or FALSE
. This is similar to filter()
, except condition
can only be a single value (i.e. a vector of length 1), whereas filter()
works for entire vectors (or columns).
You can nest as many if...else
conditional statements as required. For example:
if (this) {
# do that
} else if (that) {
# do something else
} else {
# do something entirely different
}
x <- 7
if(x >= 10){
print("x exceeds acceptable tolerance levels")
} else if(x >= 0 & x < 10){
print("x is within acceptable tolerance levels")
} else {
print("x is negative")
}
This can get tedious if you need to consider many conditions. There are alternatives in R for some of these long conditional statements. For instance, if you want to convert a continuous (or numeric) variable to categories, use cut()
, which divides the range of x
into intervals and codes the values of x
according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.
library(gapminder)
gapminder %>%
select(country, lifeExp) %>%
mutate(lifeExp_autobin = cut(lifeExp, breaks = 5),
lifeExp_manbin = cut(lifeExp,
breaks = c(0, 30, 50, 62, 75, 90),
labels = c("Very Short", "Short", "Average", "Long", "Very Long")))
## # A tibble: 1,704 x 4
## country lifeExp lifeExp_autobin lifeExp_manbin
## <fct> <dbl> <fct> <fct>
## 1 Afghanistan 28.8 (23.5,35.4] Very Short
## 2 Afghanistan 30.3 (23.5,35.4] Short
## 3 Afghanistan 32.0 (23.5,35.4] Short
## 4 Afghanistan 34.0 (23.5,35.4] Short
## 5 Afghanistan 36.1 (35.4,47.2] Short
## 6 Afghanistan 38.4 (35.4,47.2] Short
## 7 Afghanistan 39.9 (35.4,47.2] Short
## 8 Afghanistan 40.8 (35.4,47.2] Short
## 9 Afghanistan 41.7 (35.4,47.2] Short
## 10 Afghanistan 41.8 (35.4,47.2] Short
## # ... with 1,694 more rows
if
versus if_else()
Because if-else conditional statements like the ones outlined above must always resolve to a single TRUE
or FALSE
, they cannot be used for vector operations. Vector operations are where you make multiple comparisons simultaneously for each value stored inside a vector. Consider the gapminder
data and imagine you wanted to create a new column identifying whether for a particular country and year, life exectancy was greater than 50 years. This sounds like a classic if-else operation. For each country, if lifeExp > 50
, then the value in the new column should be “Life expectancy greater than 50”. But what happens if we try to implement this using an if-else operation?
(lifeExp_if <- gapminder %>%
mutate(over50 = if(lifeExp > 50){
"Life expectancy greater than 50"
} else {
"Life expectancy less than 50"
}))
## Warning in if (lifeExp > 50) {: the condition has length > 1 and only the first element will be used
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap over50
## <fct> <fct> <int> <dbl> <int> <dbl> <chr>
## 1 Afghanistan Asia 1952 28.8 8425333 779. Life expectancy less than 50
## 2 Afghanistan Asia 1957 30.3 9240934 821. Life expectancy less than 50
## 3 Afghanistan Asia 1962 32.0 10267083 853. Life expectancy less than 50
## 4 Afghanistan Asia 1967 34.0 11537966 836. Life expectancy less than 50
## 5 Afghanistan Asia 1972 36.1 13079460 740. Life expectancy less than 50
## 6 Afghanistan Asia 1977 38.4 14880372 786. Life expectancy less than 50
## 7 Afghanistan Asia 1982 39.9 12881816 978. Life expectancy less than 50
## 8 Afghanistan Asia 1987 40.8 13867957 852. Life expectancy less than 50
## 9 Afghanistan Asia 1992 41.7 16317921 649. Life expectancy less than 50
## 10 Afghanistan Asia 1997 41.8 22227415 635. Life expectancy less than 50
## # ... with 1,694 more rows
This did not work correctly. Because if()
can only handle a single TRUE
/FALSE
value, it only checked the first row of the data frame. That row, Afghanistan in 1952, contained 28.801 for lifeExp
, so it generated a vector of length 1704 with every single value being “Life expectancy less than 50”.
## # A tibble: 1 x 2
## over50 n
## <chr> <int>
## 1 Life expectancy less than 50 1704
If we wanted to make this if-else comparison 1704 times, we should instead use if_else()
. This vectorizes the if-else comparison and makes a separate comparison for each row of the data frame. This allows us to correctly generate this new column.1
(lifeExp_ifelse <- gapminder %>%
mutate(over50 = if_else(lifeExp > 50,
"Life expectancy greater than 50",
"Life expectancy less than 50")))
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap over50
## <fct> <fct> <int> <dbl> <int> <dbl> <chr>
## 1 Afghanistan Asia 1952 28.8 8425333 779. Life expectancy less than 50
## 2 Afghanistan Asia 1957 30.3 9240934 821. Life expectancy less than 50
## 3 Afghanistan Asia 1962 32.0 10267083 853. Life expectancy less than 50
## 4 Afghanistan Asia 1967 34.0 11537966 836. Life expectancy less than 50
## 5 Afghanistan Asia 1972 36.1 13079460 740. Life expectancy less than 50
## 6 Afghanistan Asia 1977 38.4 14880372 786. Life expectancy less than 50
## 7 Afghanistan Asia 1982 39.9 12881816 978. Life expectancy less than 50
## 8 Afghanistan Asia 1987 40.8 13867957 852. Life expectancy less than 50
## 9 Afghanistan Asia 1992 41.7 16317921 649. Life expectancy less than 50
## 10 Afghanistan Asia 1997 41.8 22227415 635. Life expectancy less than 50
## # ... with 1,694 more rows
## # A tibble: 2 x 2
## over50 n
## <chr> <int>
## 1 Life expectancy greater than 50 1213
## 2 Life expectancy less than 50 491
If you want to save a function to be used at other time, you can either create a package
, or you can save your function in an R script file. As an example, we can save the PV() function from the previous section as an R script file named PV.R
. Then, provided the script file is in the same project or working directory, we can load the code for our function by typing source("PV.R")
.
Iteration automates a multi-step process by organizing sequences of R commands. R comprises several iteration control techniques that allow you to perform repetititve tasks with different goals. These techniques are
while
loop to iterate until a logical statement returns FALSEfor
loop to iterate over a fixed number of iterationsrepeat
loop to execute until told to breakbreak/next
arguments to exit and skip interations in a loopwhile
loopA while
loop is a simple thing. The basic format of the loop looks like this:
while (condition ) {
statement_1
statement_2
ETC
}
The code corresponding to condition
needs to produce a logical value, either TRUE
or FALSE
. Whenever R encounters a while statement, it checks to see if the condition is TRUE
. If it is, then R goes on to execute all of the commands inside the curly brackets, proceeding from top to bottom as usual. However, when it gets to the bottom of those statements, it moves back up to the while statement and checks if the condition is TRUE
. If it is, then R continues until at some point the condition
turns out to be FALSE
. Once that happens, R jumps to the bottom of the loop (i.e., to the }
character), and then continues on with whatever commands appear next in the script.
Suppose you get a loan of 20,000 at an interest rate of 5% and agree to pay to the bank 375 every month over the next 5 years, or 60 months. Will you be able to repay your loan? We could simulate the whole process with R to tell us what is going to happen.
# Specify the variables that define the problem
month <- 0 # count the number of months
balance <- 20000 # initial mortgage balance
payments <- 375 # amount of monthly payment
interest <- 0.05 # 5% interest rate per year
total_paid <- 0 # track what you have paid the bank
# convert annual interest to a monthly multiplier. Since we have compound interest, it's not just just 5%/12, but rather
# the monthly rate x is the value that makes (1+x)^12 = 1.05
monthly_multiplier <- (1 + interest) ^ (1/12)
# keep looping until the loan is paid off...
while(balance > 0){
# do the calculations for this month
month <- month + 1 # one more month
balance <- balance * monthly_multiplier # add the interest
balance <- balance - payments # make the payments
total_paid <- total_paid + payments # track the total paid
# print the results on screen
cat("month", month, ": balance", round(balance), "\n")
}
# print the total payments at the end
cat("total payments made", total_paid, "\n")
When we run this code, R checks the condition that balance > 0
. Since the starting value of balance
is 20000, the condition is TRUE
, so it enters the body of the loop (inside the curly braces). The commands here instruct R to increase the value of month
by 1, and to do all the book-keepig necessary for the loan, namely take the starting balance, add the interest, subtract the payment, calcualte the new balance and keep track of the total amount paid. R then returns to the top of the loop, and rechecks the condition that balance > 0
.
for
loopThe for
loop is also pretty simple, though not quite as simple as the while
loop. The basic format of this loop goes like this:
for ( var in vector ) {
statement_1
statement_2
ETC
}
In a for
loop, R runs a fixed number of iterations. We have a vector which has several elements, each one corresponding to a possible value of the variable var
. In the first iteration of the loop, var
is given a value corresponding to the first element of vector; in the second iteration of the loop var
gets a value corresponding to the second value in vector; and so on. Once we have exhausted all of the values in the vector, the loop terminates and the flow of the program continues.
Let’s say we want to print out the first ten letters of the English alphabet. We want to start with the first letter, and then the second one, etc; in other words, what we want to do print(i)
for every i
within the range spanned by 1:10
, and then print the answer to the console. Because we have a fixed range of values that we want to loop over, this situation is well-suited to a for
loop. Here’s the code:
## [1] "a"
## [1] "b"
## [1] "c"
## [1] "d"
## [1] "e"
## [1] "f"
## [1] "g"
## [1] "h"
## [1] "i"
## [1] "j"
The intuition here is that R starts by setting i
to 1. It then takes letters
and prints its first value, namely a, then moves back to the top of the loop. When it gets there, it increases i
by 1, and then repeats the calculation. It keeps doing this until i
reaches 10 and then it stops.
for
loopLet us create a dataframe with three variables (columns) that contain 100 random numbers drawn from the standard Normal distribution with a mean = 0 and standard deviation = 1 N(0,1)
We can to compute the median for each column, using base R
## [1] -0.0333
## [1] -0.104
## [1] -0.15
We’ve copied-and-pasted median()
three times, but we could instead use a for
loop:
output <- vector(mode = "double", length = ncol(df))
for (i in seq_along(df)) {
output[[i]] <- median(df[[i]])
}
output
## [1] -0.0333 -0.1040 -0.1498
To edit or add columns to a dataframe, it’s easier to use the mutate
function from the dplyr
package. If we want to create a new column gdp
in the gapminder
dataframe, expressed in billions of US$. Given that gapminder
has data on the population pop
and GDP per capita gdpPercap
, instead of creating a for
loop, we can get a column with the total GDP as shown below
## # A tibble: 1,704 x 7
## country continent year lifeExp pop gdpPercap GDP
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6.57
## 2 Afghanistan Asia 1957 30.3 9240934 821. 7.59
## 3 Afghanistan Asia 1962 32.0 10267083 853. 8.76
## 4 Afghanistan Asia 1967 34.0 11537966 836. 9.65
## 5 Afghanistan Asia 1972 36.1 13079460 740. 9.68
## 6 Afghanistan Asia 1977 38.4 14880372 786. 11.7
## 7 Afghanistan Asia 1982 39.9 12881816 978. 12.6
## 8 Afghanistan Asia 1987 40.8 13867957 852. 11.8
## 9 Afghanistan Asia 1992 41.7 16317921 649. 10.6
## 10 Afghanistan Asia 1997 41.8 22227415 635. 14.1
## # ... with 1,694 more rows
Frequently when working with data frames you may wish to apply a specific function to multiple columns. For instance, calculating the average value of each column in mtcars
. If we wanted to calculate the average of a single column, it would be pretty simple using just dplyr::summarise
function:
## mpg
## 1 20.1
If we want to calculate the mean for all variables, we’d have to duplicate this code many times:
mtcars %>%
summarise(mpg = mean(mpg),
cyl = mean(cyl),
disp = mean(disp),
hp = mean(hp),
drat = mean(drat),
wt = mean(wt),
qsec = mean(qsec),
vs = mean(vs),
am = mean(am),
gear = mean(gear),
carb = mean(carb))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 20.1 6.19 231 147 3.6 3.22 17.8 0.438 0.406 3.69 2.81
This is very repetitive (and boring!), but more importantly it is prone to mistakes. We can use loops and map()
functions to find means for all variables, but we can also do it using scoped verbs.
Scoped verbs allow you to use standard verbs (or functions) in dplyr
that affect multiple variables at once, combining both elements of repetition and (in some cases) conditional expressions:
_if
allows you to pick variables based on a predicate function like is.numeric()
or is.character()
_at
allows you to pick variables using the same syntax as select()
_all
operates on all variablessummarise_all()
summarize_all()
takes a dataframe and a function and applies that function to each column:
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 20.1 6.19 231 147 3.6 3.22 17.8 0.438 0.406 3.69 2.81
If you want to apply multiple summaries, use the funs()
helper. In this case, we want to get the min/max values, the mean and the standard deviation (sd) for all variables
## mpg_min cyl_min disp_min hp_min drat_min wt_min qsec_min vs_min am_min gear_min carb_min mpg_max cyl_max disp_max hp_max drat_max
## 1 10.4 4 71.1 52 2.76 1.51 14.5 0 0 3 1 33.9 8 472 335 4.93
## wt_max qsec_max vs_max am_max gear_max carb_max mpg_mean cyl_mean disp_mean hp_mean drat_mean wt_mean qsec_mean vs_mean am_mean
## 1 5.42 22.9 1 1 5 8 20.1 6.19 231 147 3.6 3.22 17.8 0.438 0.406
## gear_mean carb_mean mpg_sd cyl_sd disp_sd hp_sd drat_sd wt_sd qsec_sd vs_sd am_sd gear_sd carb_sd
## 1 3.69 2.81 6.03 1.79 124 68.6 0.535 0.978 1.79 0.504 0.499 0.738 1.62
You can combine this with group_by()
to calculate group-level summary statistics:
## # A tibble: 3 x 11
## gear mpg cyl disp hp drat wt qsec vs am carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 16.1 7.47 326. 176. 3.13 3.89 17.7 0.2 0 2.67
## 2 4 24.5 4.67 123. 89.5 4.04 2.62 19.0 0.833 0.667 2.33
## 3 5 21.4 6 202. 196. 3.92 2.63 15.6 0.2 1 4.4
summarise_at()
summarise_at()
allows you to pick columns in the same way as select()
, that is, based on their names. There is one small difference: you need to wrap the complete selection with the vars()
helper (this avoids ambiguity). In the following example, we calculate the mean for all variables, but exclude mpg
.
## cyl disp hp drat wt qsec vs am gear carb
## 1 6.19 231 147 3.6 3.22 17.8 0.438 0.406 3.69 2.81
By default, the newly created columns have the shortest names needed to uniquely identify the output.
# calculate min/max for just the mpg variable
summarise_at(mtcars, .vars = vars(mpg), .funs = funs(min, max))
## min max
## 1 10.4 33.9
# calculate min for just the mpg and wt variables
summarise_at(mtcars, .vars = vars(mpg, wt), .funs = min)
## mpg wt
## 1 10.4 1.51
# calculate min/max for all but the mpg variable
summarise_at(mtcars, .vars = vars(-mpg), .funs = funs(min, max))
## cyl_min disp_min hp_min drat_min wt_min qsec_min vs_min am_min gear_min carb_min cyl_max disp_max hp_max drat_max wt_max qsec_max
## 1 4 71.1 52 2.76 1.51 14.5 0 0 3 1 8 472 335 4.93 5.42 22.9
## vs_max am_max gear_max carb_max
## 1 1 1 5 8
summarise_if()
summarise_at()
allows you to pick variables to summarize based on their name, whereas summarise_if()
allows you to pick variables to summarize based on some property of the column. Typically this is their type because you want to (e.g.) apply a numeric summary function only to numeric columns:
# group gapminder by country and calculate the mean for numeric variables
gapminder %>%
group_by(country) %>%
summarise_if(.predicate = is.numeric, .funs = mean, na.rm = TRUE)
## # A tibble: 142 x 5
## country year lifeExp pop gdpPercap
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1980. 37.5 15823715. 803.
## 2 Albania 1980. 68.4 2580249. 3255.
## 3 Algeria 1980. 59.0 19875406. 4426.
## 4 Angola 1980. 37.9 7309390. 3607.
## 5 Argentina 1980. 69.1 28602240. 8956.
## 6 Australia 1980. 74.7 14649312. 19981.
## 7 Austria 1980. 73.1 7583298. 20412.
## 8 Bahrain 1980. 65.6 373913. 18078.
## 9 Bangladesh 1980. 49.8 90755395. 818.
## 10 Belgium 1980. 73.6 9725119. 19901.
## # ... with 132 more rows
# group gapminder by continent and calculate min, max, and mean for numeric variables
gapminder %>%
group_by(continent) %>%
summarise_if(.predicate = is.numeric, .funs = funs(min, max, mean), na.rm = TRUE)
## # A tibble: 5 x 13
## continent year_min lifeExp_min pop_min gdpPercap_min year_max lifeExp_max pop_max gdpPercap_max year_mean lifeExp_mean pop_mean
## <fct> <int> <dbl> <int> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1952 23.6 60011 241. 2007 76.4 1.35e8 21951. 1980. 48.9 9.92e6
## 2 Americas 1952 37.6 662850 1202. 2007 80.7 3.01e8 42952. 1980. 64.7 2.45e7
## 3 Asia 1952 28.8 120447 331 2007 82.6 1.32e9 113523. 1980. 60.1 7.70e7
## 4 Europe 1952 43.6 147962 974. 2007 81.8 8.24e7 49357. 1980. 71.9 1.72e7
## 5 Oceania 1952 69.1 1994794 10040. 2007 81.2 2.04e7 34435. 1980. 74.3 8.87e6
## # ... with 1 more variable: gdpPercap_mean <dbl>
(Note that na.rm = TRUE
is passed on to mean()
in the same way as in purrr::map()
.)
mutate_all()
, mutate_if()
and mutate_at()
work in a similar way to their summarise equivalents.
As an example, say we had a dataframe with various columns in 2 groups: the first group contains weights weight1
, weight2
, weight3
, etc., and the second group contains the actual data data1
and data2
.
If we wanted to cross multiply weights * data, rather than writing out for loops, we can use mutate_at
to create any combination of columns of the two groups.
df <- tibble(
# weights: 100 values, from a uniform distribution between 0 and 1
weight1 = runif(100, min = 0, max = 1),
weight2 = runif(100, min = 0, max = 1),
weight3 = runif(100, min = 0, max = 1),
weight4 = runif(100, min = 0, max = 1),
# data 1: 100 values, from a normal distribution with mean = 100 and sd=20
data1 = rnorm(100, mean = 100, sd = 20),
# data 2: 100 values, from a normal distribution with mean = 250 and sd=90
data2 = rnorm(100, mean = 250, sd = 90)
)
df <- df %>%
mutate_at(. , vars(starts_with("weight")), funs(prod1 = . * data1, prod2 = . * data2))
filter_all()
is the most useful of the three filter()
variants. You use it in conjunction with all_vars()
or any_vars()
depending on whether or not you want rows where all variables meet the criterion, or where just one variable meets it.
It’s particularly useful finding missing values:
library(nycflights13)
# Rows where any value is missing
filter_all(weather, .vars_predicate = any_vars(is.na(.)))
## # A tibble: 21,135 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA 0 1012 10 2013-01-01 01:00:00
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA 0 1012. 10 2013-01-01 02:00:00
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA 0 1012. 10 2013-01-01 03:00:00
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA 0 1012. 10 2013-01-01 04:00:00
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA 0 1012. 10 2013-01-01 05:00:00
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA 0 1012. 10 2013-01-01 06:00:00
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0 NA 0 1012. 10 2013-01-01 07:00:00
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4 NA 0 1012. 10 2013-01-01 08:00:00
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0 NA 0 1013. 10 2013-01-01 09:00:00
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8 NA 0 1012. 10 2013-01-01 10:00:00
## # ... with 21,125 more rows
# Rows where all wind variables are missing
filter_at(weather, .vars = vars(starts_with("wind")),
.vars_predicate = all_vars(is.na(.)))
## # A tibble: 4 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dttm>
## 1 EWR 2013 3 27 17 52.0 19.0 27.0 NA NA NA 0 1012. 10 2013-03-27 17:00:00
## 2 JFK 2013 5 22 10 62.1 59 93.8 NA NA NA 0 NA 2.5 2013-05-22 10:00:00
## 3 JFK 2013 7 4 6 73.0 71.1 93.5 NA NA NA 0 1024. 6 2013-07-04 06:00:00
## 4 JFK 2013 7 20 6 81.0 71.1 71.9 NA NA NA 0 1010. 10 2013-07-20 06:00:00
This page last updated on: 2020-07-14
Notice that is also preserves missing values in the new column. Remember, any operation performed on a missing value will itself become a missing value.↩︎