7. Import data

7.1 Overview

One of the things that I found strange when I started working with R was that, unlike other software like Excel, Stata, SPSS, etc., you couldn’t just double click on an .xls, .dta, or .sav file, load the data and look at its contents. In R, we must use a command to explicitly import the data into memory.

While there are many possible data formats, we will concentrate on CSV files, namely Comma Separated Values files that are a common way to save the raw data from spreadsheets, without any of the formatting, etc. The readr R package contains functions for importing data saved as flat file documents; readr is a core member of the tidyverse and is loaded everytime you call library(tidyverse).

CSV file names end with a .csv and if you opened one inside Excel, it would look like a regular Excel file. NASA provides an estimate of global surface temperature change which allows us to calculate weather anomalies. The data is available at https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv as a CSV file which i you open inside Excel looks something like this:

However, this is what a CSV file looks like on the inside: a bunch of values separated with commas.

By the way, if you look at the data closely, you will notice that the values in the D-N (December-November) and DJF (December-January-February) columns for the year 1880 are ***. These *** denote a missing value, in the same way that R uses the NA (or not available) value.

If you’d like R to treat these *** values as missing, you will need to convert them to NAs. One way to do this is to ask read_csv() to parse *** values as NA values when it reads in the data.

7.2 Importing CSV files: `read_csv()`

Importing CSV is part of base R using the read.csv() command. However, we will use the readr package and its read_csv() command that allows us to read flat data. read_csv() is significantly (8-10 times) faster and more user friendly than the base R command, with no need to define rownames, no stringsAsFactors = TRUE.

Even though we only concentrate on CSV files, readr has several functions that allow you to import a specific flat file format.

Function	Reads
`read_csv()`	Comma separated values
`read_csv2()`	Semi-colon separate values
`read_delim()`	General delimited files
`read_fwf()`	Fixed width files
`read_log()`	Apache log files
`read_table()`	Space separated files
`read_tsv()`	Tab delimited values

Just as you can import data, readr allows you to export data and save it locally. These functions are similar to the read_ functions and each save a tibble (or data frame) in the specific file format.

Function	Writes
`write_csv()`	Comma separated values
`write_excel_csv()`	CSV that you plan to open in Excel
`write_delim()`	General delimited files
`write_file()`	A single string, written as is
`write_lines()`	A vector of strings, one string per line
`write_tsv()`	Tab delimited values

To use a write_ function, first give it the name of the data frame to save, then give it a filename to save in your working directory.

7.2.1 Importing CSV files directly off the internet

If the CSV exists on the internet and you have the URL address, you don’t have to download it to your local machine and then import it; you can import it directly off the web using the URL link.

url <- "https://data.giss.nasa.gov/gistemp/tabledata_v3/NH.Ts+dSST.csv"
weather <- read_csv(url, skip = 1, na = "***")

When using the read_csv() function, we added two options:

skip = 1 option is there as the real data table only starts in Row 2, so we need to skip one row.
na = "***"option informs R how missing observations in the spreadsheet are coded. As discussed earlier, it is best to specify NA values here, as otherwise some of the data may not be recognized as numeric data.

Notice that the code above saves the output to an object named weather. You must save the output of read_csv() to an object if you wish to use it later; otherwise, read_csv() will just print the contents of the data set at the command line.

Also, the assignment statement doesn’t produce any output for weather because assignments don’t display anything. If we want to check that our data has been loaded, we can see glimpse the structure of the dataframe using glimpse() or see its contents by just typing its name: weather.

glimpse(weather) shows us the number of observations and variables, and then, for each variable, shows the variable type; in our case, all of the variables are dbl or double, namely numeric variables.
weather, just invoking the name of the dataframe, shows the contents of the dataframe in the tabular form it is saved in

glimpse(weather)

## Rows: 140
## Columns: 19
## $ Year  <dbl> 1880, 1881, 1882, 1883, 1884, 1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 1896, 1897, 1898, 189...
## $ Jan   <dbl> -0.54, -0.19, 0.22, -0.59, -0.23, -1.00, -0.68, -1.07, -0.53, -0.31, -0.65, -0.59, -0.45, -1.47, -0.87, -0.73, -0.44,...
## $ Feb   <dbl> -0.38, -0.25, 0.22, -0.67, -0.11, -0.37, -0.68, -0.58, -0.59, 0.35, -0.56, -0.68, -0.16, -0.89, -0.43, -0.77, -0.21, ...
## $ Mar   <dbl> -0.26, 0.02, 0.00, -0.16, -0.65, -0.21, -0.57, -0.36, -0.58, 0.07, -0.44, -0.12, -0.56, -0.28, -0.21, -0.39, -0.49, -...
## $ Apr   <dbl> -0.37, -0.02, -0.36, -0.27, -0.62, -0.53, -0.34, -0.42, -0.24, 0.15, -0.36, -0.32, -0.48, -0.30, -0.59, -0.23, -0.60,...
## $ May   <dbl> -0.11, -0.06, -0.32, -0.32, -0.42, -0.55, -0.34, -0.27, -0.16, -0.05, -0.48, -0.21, -0.28, -0.44, -0.33, -0.30, -0.22...
## $ Jun   <dbl> -0.22, -0.36, -0.38, -0.26, -0.52, -0.47, -0.43, -0.20, -0.04, -0.12, -0.20, -0.14, -0.11, -0.22, -0.48, -0.26, 0.01,...
## $ Jul   <dbl> -0.23, -0.06, -0.37, -0.09, -0.48, -0.39, -0.20, -0.23, 0.04, -0.10, -0.24, -0.12, -0.38, -0.13, -0.24, -0.18, -0.02,...
## $ Aug   <dbl> -0.24, -0.03, -0.14, -0.26, -0.50, -0.44, -0.47, -0.52, -0.19, -0.16, -0.38, -0.09, -0.28, -0.33, -0.22, -0.24, -0.07...
## $ Sep   <dbl> -0.26, -0.23, -0.17, -0.33, -0.45, -0.32, -0.34, -0.17, -0.12, -0.26, -0.40, -0.03, -0.13, -0.22, -0.26, 0.02, -0.05,...
## $ Oct   <dbl> -0.32, -0.40, -0.53, -0.21, -0.41, -0.30, -0.31, -0.40, 0.04, -0.34, -0.07, -0.26, -0.22, -0.23, -0.24, -0.04, 0.14, ...
## $ Nov   <dbl> -0.37, -0.42, -0.32, -0.40, -0.48, -0.28, -0.45, -0.19, -0.03, -0.61, -0.69, -0.56, -0.81, -0.26, -0.47, -0.06, -0.25...
## $ Dec   <dbl> -0.48, -0.28, -0.42, -0.25, -0.40, 0.00, -0.17, -0.43, -0.26, -0.55, -0.43, -0.02, -0.65, -0.63, -0.31, -0.24, -0.18,...
## $ `J-D` <dbl> -0.32, -0.19, -0.21, -0.32, -0.44, -0.40, -0.42, -0.40, -0.22, -0.16, -0.41, -0.26, -0.37, -0.45, -0.39, -0.29, -0.20...
## $ `D-N` <dbl> NA, -0.21, -0.20, -0.33, -0.43, -0.44, -0.40, -0.38, -0.24, -0.14, -0.42, -0.30, -0.32, -0.45, -0.41, -0.29, -0.20, -...
## $ DJF   <dbl> NA, -0.31, 0.06, -0.56, -0.20, -0.59, -0.45, -0.61, -0.52, -0.07, -0.59, -0.57, -0.21, -1.00, -0.64, -0.60, -0.30, -0...
## $ MAM   <dbl> -0.24, -0.02, -0.22, -0.25, -0.56, -0.43, -0.42, -0.35, -0.33, 0.06, -0.43, -0.22, -0.44, -0.34, -0.38, -0.31, -0.44,...
## $ JJA   <dbl> -0.23, -0.15, -0.30, -0.20, -0.50, -0.44, -0.37, -0.32, -0.06, -0.13, -0.27, -0.12, -0.26, -0.23, -0.31, -0.23, -0.03...
## $ SON   <dbl> -0.32, -0.35, -0.34, -0.32, -0.44, -0.30, -0.37, -0.25, -0.04, -0.40, -0.39, -0.28, -0.39, -0.24, -0.32, -0.03, -0.05...

weather

## # A tibble: 140 x 19
##     Year   Jan    Feb    Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec `J-D` `D-N`   DJF   MAM   JJA   SON
##    <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1880 -0.54 -0.38  -0.26  -0.37 -0.11 -0.22 -0.23 -0.24 -0.26 -0.32 -0.37 -0.48 -0.32 NA    NA    -0.24 -0.23 -0.32
##  2  1881 -0.19 -0.25   0.02  -0.02 -0.06 -0.36 -0.06 -0.03 -0.23 -0.4  -0.42 -0.28 -0.19 -0.21 -0.31 -0.02 -0.15 -0.35
##  3  1882  0.22  0.22   0     -0.36 -0.32 -0.38 -0.37 -0.14 -0.17 -0.53 -0.32 -0.42 -0.21 -0.2   0.06 -0.22 -0.3  -0.34
##  4  1883 -0.59 -0.67  -0.16  -0.27 -0.32 -0.26 -0.09 -0.26 -0.33 -0.21 -0.4  -0.25 -0.32 -0.33 -0.56 -0.25 -0.2  -0.32
##  5  1884 -0.23 -0.11  -0.65  -0.62 -0.42 -0.52 -0.48 -0.5  -0.45 -0.41 -0.48 -0.4  -0.44 -0.43 -0.2  -0.56 -0.5  -0.44
##  6  1885 -1    -0.37  -0.21  -0.53 -0.55 -0.47 -0.39 -0.44 -0.32 -0.3  -0.28  0    -0.4  -0.44 -0.59 -0.43 -0.44 -0.3 
##  7  1886 -0.68 -0.68  -0.570 -0.34 -0.34 -0.43 -0.2  -0.47 -0.34 -0.31 -0.45 -0.17 -0.42 -0.4  -0.45 -0.42 -0.37 -0.37
##  8  1887 -1.07 -0.580 -0.36  -0.42 -0.27 -0.2  -0.23 -0.52 -0.17 -0.4  -0.19 -0.43 -0.4  -0.38 -0.61 -0.35 -0.32 -0.25
##  9  1888 -0.53 -0.59  -0.580 -0.24 -0.16 -0.04  0.04 -0.19 -0.12  0.04 -0.03 -0.26 -0.22 -0.24 -0.52 -0.33 -0.06 -0.04
## 10  1889 -0.31  0.35   0.07   0.15 -0.05 -0.12 -0.1  -0.16 -0.26 -0.34 -0.61 -0.55 -0.16 -0.14 -0.07  0.06 -0.13 -0.4 
## # ... with 130 more rows

When you use read_csv(), read_csv() tries to match each column of input to one of the basic data types in R. In our case, read_csv() misidentified the contents of the Year column to a real number, rather than an integer. You can correct this with R’s as.integer() function, or you can read the data in again, this time instructing read_csv() to parse the column as integers.

To do this:

add the argument col_types to read.csv()
set equal the col_types arguent to a list.
add a named element to the list for each column you would like to manually parse; in our case, we want to make column ‘Year’ an integer.

weather <- read_csv(url, skip = 1, na = "***",
                   col_types = list(Year = col_integer()))

To complete the code, set Year equal to one of the functions below, each function instructs read_csv() to parse Year as a specific type of data.

Type function	Data Type
`col_character()`	character
`col_date()`	Date
`col_datetime()`	POSIXct (date-time)
`col_double()`	double (numeric)
`col_factor()`	factor
`col_guess()`	let readr geuss (default)
`col_integer()`	integer
`col_logical()`	logical
`col_number()`	numbers mixed with non-number characters
`col_numeric()`	double or integer
`col_skip()`	do not read this column
`col_time()`	time

7.2.2 Importing CSV files saved locally

If you want to read a CSV file you have saved locally in your computer, you must let RStudio know which folder the file lives in; in more technical terms, you have to set the Working Directory. You can determine the location of your working directory by running getwd(). You can change the location of your working directory by going to Session > Set Working Directory in the RStudio IDE menu bar or use the Ctrl+Shift+H shortcut in Windows, Cmd+Shift+H in Mac to browse for the folder where the file resides.

7.3 `janitor` package for cleaning variable names

When we create data files, we frequently use variable names and formats that are easily readable for humans, but no so for computers.

Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. – For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights The New York Times, 2014

janitor has many functions, but its core function is clean_names() which will make your life easier if you call it whenever you load data into R. The following example is taken from janitor’s documentation page

Let us read an Excel file with a roster of teachers at a fictional American high school, stored in the Microsoft Excel file dirty_data.xlsx.

Some of the variable names, e.g., First Name, Last Name, are not only capitalised, but also contain a space in the variable name. Let us read in the file and have a glimpse inside it.

library(readxl)
roster <- read_excel("dirty_data.xlsx")
glimpse(roster)

## Rows: 13
## Columns: 11
## $ `First Name`        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA, "James", "Hedy", "Carlo...
## $ `Last Name`         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lamarr", "Boozer", "Boozer", ...
## $ `Employee Status`   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Teacher", NA, "Teacher", "Tea...
## $ Subject             <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English", "Science", "Basketball", ...
## $ `Hire Date`         <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 34700, 40071
## $ `% Allocated`       <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
## $ `Full time?`        <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", "No"
## $ `do not edit! --->` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ Certification...9   <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science 6-12", "Science 6-12", NA,...
## $ Certification...10  <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", NA, "English 6-12", NA, NA,...
## $ Certification...11  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

We notice that if we wanted to refer to the variable for a first name (1st in the list) or percent allocated (6th in the list), we would need to refer to them as the string First Name and % Allocated respectively. To avoid this, we can use clean_names

roster_clean <- roster %>% 
  clean_names()

glimpse(roster_clean)

## Rows: 13
## Columns: 11
## $ first_name        <chr> "Jason", "Jason", "Alicia", "Ada", "Desus", "Chien-Shiung", "Chien-Shiung", NA, "James", "Hedy", "Carlos"...
## $ last_name         <chr> "Bourne", "Bourne", "Keys", "Lovelace", "Nice", "Wu", "Wu", NA, "Joyce", "Lamarr", "Boozer", "Boozer", "L...
## $ employee_status   <chr> "Teacher", "Teacher", "Teacher", "Teacher", "Administration", "Teacher", "Teacher", NA, "Teacher", "Teach...
## $ subject           <chr> "PE", "Drafting", "Music", NA, "Dean", "Physics", "Chemistry", NA, "English", "Science", "Basketball", NA...
## $ hire_date         <dbl> 39690, 39690, 37118, 27515, 41431, 11037, 11037, NA, 32994, 27919, 42221, 34700, 40071
## $ percent_allocated <dbl> 0.75, 0.25, 1.00, 1.00, 1.00, 0.50, 0.50, NA, 0.50, 0.50, NA, NA, 0.80
## $ full_time         <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", NA, "No", "No", "No", "No", "No"
## $ do_not_edit       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
## $ certification_9   <chr> "Physical ed", "Physical ed", "Instr. music", "PENDING", "PENDING", "Science 6-12", "Science 6-12", NA, N...
## $ certification_10  <chr> "Theater", "Theater", "Vocal music", "Computers", NA, "Physics", "Physics", NA, "English 6-12", NA, NA, "...
## $ certification_11  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA

Now, the variable names contain no spaces, are all lower case, and we can explicitly refer to them rather than using a string of characters– it all makes life a bit easier!

7.3.1 Code that works is not necessarily good code

According to Phil Karlton, there are only two hard things in Computer Science: cache invalidation and naming things. It is good practice to use meaningful names for variables and data frames, use spacing, comments, etc. Both Google and Hadley Wickham have great style guides for programming in R and the janitor package helps in creating variable names with a consistent style.

7.4 Other data formats

The readr package provides efficient functions for reading and saving common flat file data formats. For other data types, consider using :

Package	Reads
readxl	Excel files (.xls, .xlsx)
haven	SPSS, Stata, and SAS files
jsonlite	json
xml2	xml
httr	web API’s
rvest	web pages (web scraping)
DBI	databases
sparklyr	data loaded into spark

7.4.1 `rio`: a swiss-army knife for data input-output

A really neat package to handle importing- exporing data is rio whose authors call it A Swiss-Army knife for data input-output. It works by determining the data structure from the file extension, uses reasonable defaults for data import and export (e.g., ‘stringsAsFactors=FALSE’), supports web-based import (including from SSL/HTTPS). It also has a useful function, ‘convert()’, that provides a simple method for converting between file types. You can read more about rio here

7.5 Never work directly on the raw data

In 2012 Cecilia Giménez, an 83-year-old widow and amateur painter, attempted to restore a century-old fresco of Jesus crowned with thorns in her local church in Borja, Spain. The restoration didn’t go very well, but, surprisingly, the botched restoration of Jesus fresco miraculously saved the Spanish Town.

As a most important rule, please do not work on the raw data; it’s unlikely you will have Cecilia Giménez’s good fortune to become (in)famous for your not-so-brilliant work. Make sure you import the data in R, leave the raw data aside, and if you make any changes tidying and wrangling your data, save it using write_csv() with a different file name.

7.6 Other links

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

This page last updated on: 2020-07-14