One of the things that I found strange when I started working with R was that, unlike other software like Excel, Stata, SPSS, etc., you couldn’t just double click on an .xls, .dta, or .sav file, load the data and look at its contents. In R, we must use a command to explicitly import the data into memory.
While there are many possible data formats, we will concentrate on CSV files, namely Comma Separated Values files that are a common way to save the raw data from spreadsheets, without any of the formatting, etc. The readr R package contains functions for importing data saved as flat file documents; readr
is a core member of the tidyverse and is loaded everytime you call library(tidyverse)
.
CSV file names end with a .csv and if you opened one inside Excel, it would look like a regular Excel file. NASA provides an estimate of global surface temperature change which allows us to calculate weather anomalies. The data is available at https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv as a CSV file which i you open inside Excel looks something like this:
However, this is what a CSV file looks like on the inside: a bunch of values separated with commas.
By the way, if you look at the data closely, you will notice that the values in the D-N
(December-November) and DJF
(December-January-February) columns for the year 1880 are ***
. These ***
denote a missing value, in the same way that R uses the NA
(or not available) value.
If you’d like R to treat these ***
values as missing, you will need to convert them to NA
s. One way to do this is to ask read_csv()
to parse ***
values as NA
values when it reads in the data.
read_csv()
Importing CSV is part of base R using the read.csv()
command. However, we will use the readr
package and its read_csv()
command that allows us to read flat data. read_csv()
is significantly (8-10 times) faster and more user friendly than the base R command, with no need to define rownames, no stringsAsFactors = TRUE
.
Even though we only concentrate on CSV files, readr
has several functions that allow you to import a specific flat file format.
Function | Reads |
---|---|
read_csv() |
Comma separated values |
read_csv2() |
Semi-colon separate values |
read_delim() |
General delimited files |
read_fwf() |
Fixed width files |
read_log() |
Apache log files |
read_table() |
Space separated files |
read_tsv() |
Tab delimited values |
Just as you can import data, readr
allows you to export data and save it locally. These functions are similar to the read_
functions and each save a tibble (or data frame) in the specific file format.
Function | Writes |
---|---|
write_csv() |
Comma separated values |
write_excel_csv() |
CSV that you plan to open in Excel |
write_delim() |
General delimited files |
write_file() |
A single string, written as is |
write_lines() |
A vector of strings, one string per line |
write_tsv() |
Tab delimited values |
To use a write_
function, first give it the name of the data frame to save, then give it a filename to save in your working directory.
If the CSV exists on the internet and you have the URL address, you don’t have to download it to your local machine and then import it; you can import it directly off the web using the URL link.
url <- "https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv"
weather <- read_csv(url, skip = 1, na = "***")
When using the read_csv()
function, we added two options:
skip = 1
option is there as the real data table only starts in Row 2, so we need to skip one row.na = "***"
option informs R how missing observations in the spreadsheet are coded. As discussed earlier, it is best to specify NA values here, as otherwise some of the data may not be recognized as numeric data.Notice that the code above saves the output to an object named weather
. You must save the output of read_csv()
to an object if you wish to use it later; otherwise, read_csv()
will just print the contents of the data set at the command line.
Also, the assignment statement doesn’t produce any output for weather
because assignments don’t display anything. If we want to check that our data has been loaded, we can see glimpse the structure of the dataframe using glimpse()
or see its contents by just typing its name: weather
.
glimpse(weather)
shows us the number of observations and variables, and then, for each variable, shows the variable type; in our case, all of the variables are dbl
or double, namely numeric variables.weather
, just invoking the name of the dataframe, shows the contents of the dataframe in the tabular form it is saved in## Rows: 140
## Columns: 19
## $ Year <dbl> 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 189...
## $ Jan <dbl> -0.54, -0.19, 0.22, -0.59, -0.23, -1.00, -0.68, -1.07, -0.53, -0.31, -0.65, -0.59, -0.45, -1.47, -0.87, -0.73, -0.44,...
## $ Feb <dbl> -0.38, -0.25, 0.22, -0.67, -0.11, -0.37, -0.68, -0.58, -0.59, 0.35, -0.56, -0.68, -0.16, -0.89, -0.43, -0.77, -0.21, ...
## $ Mar <dbl> -0.26, 0.02, 0.00, -0.16, -0.65, -0.21, -0.57, -0.36, -0.58, 0.07, -0.44, -0.12, -0.56, -0.28, -0.21, -0.39, -0.49, -...
## $ Apr <dbl> -0.37, -0.02, -0.36, -0.27, -0.62, -0.53, -0.34, -0.42, -0.24, 0.15, -0.36, -0.32, -0.48, -0.30, -0.59, -0.23, -0.60,...
## $ May <dbl> -0.11, -0.06, -0.32, -0.32, -0.42, -0.55, -0.34, -0.27, -0.16, -0.05, -0.48, -0.21, -0.28, -0.44, -0.33, -0.30, -0.22...
## $ Jun <dbl> -0.22, -0.36, -0.38, -0.26, -0.52, -0.47, -0.43, -0.20, -0.04, -0.12, -0.20, -0.14, -0.11, -0.22, -0.48, -0.26, 0.01,...
## $ Jul <dbl> -0.23, -0.06, -0.37, -0.09, -0.48, -0.39, -0.20, -0.23, 0.04, -0.10, -0.24, -0.12, -0.38, -0.13, -0.24, -0.18, -0.02,...
## $ Aug <dbl> -0.24, -0.03, -0.14, -0.26, -0.50, -0.44, -0.47, -0.52, -0.19, -0.16, -0.38, -0.09, -0.28, -0.33, -0.22, -0.24, -0.07...
## $ Sep <dbl> -0.26, -0.23, -0.17, -0.33, -0.45, -0.32, -0.34, -0.17, -0.12, -0.26, -0.40, -0.03, -0.13, -0.22, -0.26, 0.02, -0.05,...
## $ Oct <dbl> -0.32, -0.40, -0.53, -0.21, -0.41, -0.30, -0.31, -0.40, 0.04, -0.34, -0.07, -0.26, -0.22, -0.23, -0.24, -0.04, 0.14, ...
## $ Nov <dbl> -0.37, -0.42, -0.32, -0.40, -0.48, -0.28, -0.45, -0.19, -0.03, -0.61, -0.69, -0.56, -0.81, -0.26, -0.47, -0.06, -0.25...
## $ Dec <dbl> -0.48, -0.28, -0.42, -0.25, -0.40, 0.00, -0.17, -0.43, -0.26, -0.55, -0.43, -0.02, -0.65, -0.63, -0.31, -0.24, -0.18,...
## $ `J-D` <dbl> -0.32, -0.19, -0.21, -0.32, -0.44, -0.40, -0.42, -0.40, -0.22, -0.16, -0.41, -0.26, -0.37, -0.45, -0.39, -0.29, -0.20...
## $ `D-N` <dbl> NA, -0.21, -0.20, -0.33, -0.43, -0.44, -0.40, -0.38, -0.24, -0.14, -0.42, -0.30, -0.32, -0.45, -0.41, -0.29, -0.20, -...
## $ DJF <dbl> NA, -0.31, 0.06, -0.56, -0.20, -0.59, -0.45, -0.61, -0.52, -0.07, -0.59, -0.57, -0.21, -1.00, -0.64, -0.60, -0.30, -0...
## $ MAM <dbl> -0.24, -0.02, -0.22, -0.25, -0.56, -0.43, -0.42, -0.35, -0.33, 0.06, -0.43, -0.22, -0.44, -0.34, -0.38, -0.31, -0.44,...
## $ JJA <dbl> -0.23, -0.15, -0.30, -0.20, -0.50, -0.44, -0.37, -0.32, -0.06, -0.13, -0.27, -0.12, -0.26, -0.23, -0.31, -0.23, -0.03...
## $ SON <dbl> -0.32, -0.35, -0.34, -0.32, -0.44, -0.30, -0.37, -0.25, -0.04, -0.40, -0.39, -0.28, -0.39, -0.24, -0.32, -0.03, -0.05...
## # A tibble: 140 x 19
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec `J-D` `D-N` DJF MAM JJA SON
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1880 -0.54 -0.38 -0.26 -0.37 -0.11 -0.22 -0.23 -0.24 -0.26 -0.32 -0.37 -0.48 -0.32 NA NA -0.24 -0.23 -0.32
## 2 1881 -0.19 -0.25 0.02 -0.02 -0.06 -0.36 -0.06 -0.03 -0.23 -0.4 -0.42 -0.28 -0.19 -0.21 -0.31 -0.02 -0.15 -0.35
## 3 1882 0.22 0.22 0 -0.36 -0.32 -0.38 -0.37 -0.14 -0.17 -0.53 -0.32 -0.42 -0.21 -0.2 0.06 -0.22 -0.3 -0.34
## 4 1883 -0.59 -0.67 -0.16 -0.27 -0.32 -0.26 -0.09 -0.26 -0.33 -0.21 -0.4 -0.25 -0.32 -0.33 -0.56 -0.25 -0.2 -0.32
## 5 1884 -0.23 -0.11 -0.65 -0.62 -0.42 -0.52 -0.48 -0.5 -0.45 -0.41 -0.48 -0.4 -0.44 -0.43 -0.2 -0.56 -0.5 -0.44
## 6 1885 -1 -0.37 -0.21 -0.53 -0.55 -0.47 -0.39 -0.44 -0.32 -0.3 -0.28 0 -0.4 -0.44 -0.59 -0.43 -0.44 -0.3
## 7 1886 -0.68 -0.68 -0.570 -0.34 -0.34 -0.43 -0.2 -0.47 -0.34 -0.31 -0.45 -0.17 -0.42 -0.4 -0.45 -0.42 -0.37 -0.37
## 8 1887 -1.07 -0.580 -0.36 -0.42 -0.27 -0.2 -0.23 -0.52 -0.17 -0.4 -0.19 -0.43 -0.4 -0.38 -0.61 -0.35 -0.32 -0.25
## 9 1888 -0.53 -0.59 -0.580 -0.24 -0.16 -0.04 0.04 -0.19 -0.12 0.04 -0.03 -0.26 -0.22 -0.24 -0.52 -0.33 -0.06 -0.04
## 10 1889 -0.31 0.35 0.07 0.15 -0.05 -0.12 -0.1 -0.16 -0.26 -0.34 -0.61 -0.55 -0.16 -0.14 -0.07 0.06 -0.13 -0.4
## # ... with 130 more rows
When you use read_csv()
, read_csv()
tries to match each column of input to one of the basic data types in R. In our case, read_csv()
misidentified the contents of the Year
column to a real number, rather than an integer. You can correct this with R’s as.integer()
function, or you can read the data in again, this time instructing read_csv()
to parse the column as integers.
To do this:
col_types
to read.csv()
col_types
arguent to a list.To complete the code, set Year
equal to one of the functions below, each function instructs read_csv()
to parse Year
as a specific type of data.
Type function | Data Type |
---|---|
col_character() |
character |
col_date() |
Date |
col_datetime() |
POSIXct (date-time) |
col_double() |
double (numeric) |
col_factor() |
factor |
col_guess() |
let readr geuss (default) |
col_integer() |
integer |
col_logical() |
logical |
col_number() |
numbers mixed with non-number characters |
col_numeric() |
double or integer |
col_skip() |
do not read this column |
col_time() |
time |
If you want to read a CSV file you have saved locally in your computer, you must let RStudio know which folder the file lives in; in more technical terms, you have to set the Working Directory. You can determine the location of your working directory by running getwd()
. You can change the location of your working directory by going to Session > Set Working Directory in the RStudio IDE menu bar or use the Ctrl+Shift+H
shortcut in Windows, Cmd+Shift+H
in Mac to browse for the folder where the file resides.
janitor
package for cleaning variable namesWhen we create data files, we frequently use variable names and formats that are easily readable for humans, but no so for computers.
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. – For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights The New York Times, 2014
janitor
has many functions, but its core function is clean_names()
which will make your life easier if you call it whenever you load data into R. The following example is taken from janitor’s documentation page
Let us read an Excel file with a roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx
.
Some of the variable names, e.g., First Name
, Last Name
, are not only capitalised, but also contain a space in the variable name. Let us read in the file and have a glimpse inside it.
## Rows: 13
## Columns: 11
## $ `First Name` <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA, "James", "Hedy", "Carlo...
## $ `Last Name` <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lamarr", "Boozer", "Boozer", ...
## $ `Employee Status` <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Teacher", NA, "Teacher", "Tea...
## $ Subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English", "Science", "Basketball", ...
## $ `Hire Date` <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 34700, 40071
## $ `% Allocated` <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
## $ `Full time?` <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", "No"
## $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ Certification...9 <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science 6-12", "Science 6-12", NA,...
## $ Certification...10 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", NA, "English 6-12", NA, NA,...
## $ Certification...11 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
We notice that if we wanted to refer to the variable for a first name (1st in the list) or percent allocated (6th in the list), we would need to refer to them as the string First Name
and % Allocated
respectively. To avoid this, we can use clean_names
## Rows: 13
## Columns: 11
## $ first_name <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA, "James", "Hedy", "Carlos"...
## $ last_name <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lamarr", "Boozer", "Boozer", "L...
## $ employee_status <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Teacher", NA, "Teacher", "Teach...
## $ subject <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English", "Science", "Basketball", NA...
## $ hire_date <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 34700, 40071
## $ percent_allocated <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
## $ full_time <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", "No"
## $ do_not_edit <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ certification_9 <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science 6-12", "Science 6-12", NA, N...
## $ certification_10 <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", NA, "English 6-12", NA, NA, "...
## $ certification_11 <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
Now, the variable names contain no spaces, are all lower case, and we can explicitly refer to them rather than using a string of characters– it all makes life a bit easier!
According to Phil Karlton, there are only two hard things in Computer Science: cache invalidation and naming things. It is good practice to use meaningful names for variables and data frames, use spacing, comments, etc. Both Google and Hadley Wickham have great style guides for programming in R and the janitor
package helps in creating variable names with a consistent style.
The readr
package provides efficient functions for reading and saving common flat file data formats. For other data types, consider using :
Package | Reads |
---|---|
readxl | Excel files (.xls, .xlsx) |
haven | SPSS, Stata, and SAS files |
jsonlite | json |
xml2 | xml |
httr | web API’s |
rvest | web pages (web scraping) |
DBI | databases |
sparklyr | data loaded into spark |
rio
: a swiss-army knife for data input-outputA really neat package to handle importing- exporing data is rio
whose authors call it A Swiss-Army knife for data input-output. It works by determining the data structure from the file extension, uses reasonable defaults for data import and export (e.g., ‘stringsAsFactors=FALSE’), supports web-based import (including from SSL/HTTPS). It also has a useful function, ‘convert()’, that provides a simple method for converting between file types. You can read more about rio
here
In 2012 Cecilia Giménez, an 83-year-old widow and amateur painter, attempted to restore a century-old fresco of Jesus crowned with thorns in her local church in Borja, Spain. The restoration didn’t go very well, but, surprisingly, the botched restoration of Jesus fresco miraculously saved the Spanish Town.
As a most important rule, please do not work on the raw data; it’s unlikely you will have Cecilia Giménez’s good fortune to become (in)famous for your not-so-brilliant work. Make sure you import the data in R, leave the raw data aside, and if you make any changes tidying and wrangling your data, save it using write_csv()
with a different file name.
This page last updated on: 2020-07-14