Reproducibility is the idea that data analyses, and more generally scientific claims, are published with their data and software code so that others may try to replicate the same work, get similar results, and build upon the works of others.
While this sounds obvious, it actually happens far less frequently than what it should.
For instance, scientists at the biotechnology company Amgen were unable to replicate the majority of published pre-clinical cancer research studies; as a matter of fact, only 6 out of 53 landmark results could be reproduced. Similarly, it has been argued that the great majority of preclinical results cannot be reproduced, leading to an annual estimate of the cost of irreproducibility on preclinical research industry to be equal to 28 Billion USD.
“You are always working with at least one collaborator: Future you.”
– Hadley Wickham
Suppose that your colleague sends you an Excel file with an analysis she has undertaken. The Excel file is likely to contain the raw data, but also graphs, results, etc. that were generated from the data. If you have ever received such an Excel analysis file, it takes a long time to navigate around it and try to understand the logic used to arrive at the results.
Data analysts who implement reproducibility in their projects can quickly and easily reproduce the original results and trace back to determine how they were derived. Literate programming, an idea from Donald Knuth, is a technique for mixing written text, where you write notes explaining what you did and why, and chunks of code that produce your graphs, analyses, etc.
This makes documentation of code easier, enables verification and replication, and allows the analyst to precisely replicate her analysis. This is extremely important when revisiting work done months later, because it’s highly likely you won’t remember how all the code/analysis works together when completing your work.
Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.
– Donald E. Knuth (1984), Literate Programming
Reproducibility is also key for communicating findings with other and decision makers; it allows them to follow your logic and verify your results, assess your assumptions, and understand how your answers were formed rather than solely relying on your claimed results. In the data science framework employed in R for Data Science, reproducibility is infused throughout the entire workflow.
Your reproducibility goals should be:
Markdown is a lightweight markup language that allows you to quickly write and format text, which is then converted to different types of output.
R Markdown is one approach to ensuring reproducibility by providing a single cohesive authoring framework. It allows you to combine code, output, and analysis into a single document, are easily reproducible, and can be output to many different file formats. R Markdown is just one tool for enabling reproducibility.
rmarkdown
and knitr
is a powerful combination of packages for literate programming, reproducible analysis, and document generation, which can:
An R Markdown file is a plain text file that uses the extension .Rmd
and contains three (3) major components:
---
s. This is the metadata of the document and it tells you how it is formed - what the title is, the author, date, output, and other control information.```
Code chunks are interspersed with text throughout the document. To complete the document, you “Knit” or “render” the document. Most of you probably knit the document by clicking the “Knit” button in the script editor panel. You can also do this programmatically from the console by running the command rmarkdown::render("example.Rmd")
.
When you knit the document you send your .Rmd
file to knitr
, a package for R that executes all the code chunks and creates a second markdown document (.md
). That markdown document is then passed onto pandoc, a document rendering software program independent from R. Pandoc allows users to convert back and forth between many different document formats such as HTML, \(\LaTeX\), Microsoft Word, etc. By splitting the workflow up, you can convert your R Markdown document into a wide range of output formats.
There is a great online tutorial using the Markdown syntax. There is also a quick reference guide to Markdown built-in to RStudio. To access it, go to Help > Markdown Quick Reference.
Code chunks are where you store R code that will be executed. You can name a code chunk using the syntax ```{r name-here}
. Naming chunks is a good practice to get into for several reasons. First, it makes navigating an R Markdown document using the drop-down code navigator in the bottom-left of the script editor easier since your chunks will have intuitive names. Second, it generates meaningful file names for any graphs created within the chunk, rather than unhelpful names such as unnamed-chunk-1.png
. Finally, once you start caching your results (more on that below), using consistent names for chunks avoids having to repeat computationally intensive calculations.
A word of caution: If you use the same chunk name more than once,
knitr
will give you an error message and refuse to knit your Rmd document.
Code chunks can be customized to adjust the output of the chunk. Some important and useful options are:
eval = FALSE
- prevents code from being evaluated. I use this in my notes for class when I want to show how to write a specific function but don’t need to actually use it.include = FALSE
- runs the code but doesn’t show the code or results in the final document. This is useful when you have setup code at the beginning of your document (loading packages, adjusting options, etc.) that may generate a lot of messages that are not really necessary to include in the final report.echo = FALSE
- prevents code from showing in the final output, but does show the results of the code. Use this if you are writing a paper or document for someone who cares more about the substantive results and less about the programming used to obtain them.message = FALSE
or warning = FALSE
- prevents messages or warnings from appearing in the final document.results = 'hide'
- hides printed output.error = TRUE
- causes the document to continue knitting and rendering even if the code generates a fatal error. If you’re debugging your code, you might want to use this option. However, for the final version of your work, you do not want to allow errors to pass through unnoticed.By default, every time you knit a document R starts anew and no previous results are saved.
If you have code chunks that run computationally intensive tasks, like running a ggpairs()
correlation/scatterplot matrix in a large dataset, you might want to store these results to be more efficient and save time. If you use cache = TRUE
, R will do exactly this. The output of the chunk will be saved to a specially named file on disk. Now, every time you knit the document the cached results will be used instead of running the code fresh.
Rather than setting these options for each individual chunk, you can make them the default options for all chunks by using knitr::opts_chunk$set()
. Just include this in a code chunk (typically in the first code chunk in the document). So for example,
hides the code by default in all code chunks. To override this new default, you can still declare echo = TRUE
for individual chunks.
Until now, you have only run code in a specially designated chunk. However you can also run R code in-line by using the `r `
syntax. You may want to run the code inline to name the number of variables or rows in a dataset in a sentence like:
There are XXX variables and YYY observations in the
gapminder
dataset.
You can call code “inline” like this:
There are `r ncol(gapminder)` variables and `r nrow(gapminder)` observations in the `gapminder` dataset.
There are 6 variables and 1704 observations in the `gapminder` dataset.
What is great about this is that if your data changes, say a new version of gapminder
comes out, then you don’t need to worry where you mentioned the number of variables/observations of data, you just re-knit your Rmd document.
Yet Another Markup Language, or YAML (rhymes with camel) is a standardized format for storing hierarchical data in a human-readable syntax. The YAML header controls how rmarkdown
renders your .Rmd
file. A YAML header is a section of key: value
pairs surrounded by ---
marks.
---
title: "My report"
author: "Kostis Christodoulou"
date: 2019-06-30
output: html_document
---
The most important option is output
, as this determines the final document format. However there are other common options such as providing a title
and author
for your document and specifying the date
of publication.
For your homework assignments, we have used github_document
to generate a Markdown document. However there are other document formats that are more commonly used.
output: html_document
produces an HTML document. The nice feature of this document is that all images are embedded in the HTML file itself, so you can email just the .html
file to someone and they will be able to open and read it.
Each output format has various options to customize the appearance of the final document. One option for HTML documents is to add a table of contents through the toc
option. To add any option for an output format, just add it in a hierarchical format like this:
---
title: "My report"
author: "Kostis Christodoulou"
date: 2019-06-30
output:
html_document:
toc: true
toc_depth: 2
You can explicitly set the number of levels included in the table of contents with toc_depth
(the default is 3).
There are several options that control the visual appearance of HTML documents.
theme
specifies the Bootstrap theme to use for the page (themes are drawn from the Bootswatch theme library). Valid themes include "default"
, "cerulean"
, "journal"
, "flatly"
, "readable"
, "spacelab"
, "united"
, "cosmo"
, "lumen"
, "paper"
, "sandstone"
, "simplex"
, and "yeti"
.highlight
specifies the syntax highlighting style for code chunks. Supported styles include "default"
, "tango"
, "pygments"
, "kate"
, "monochrome"
, "espresso"
, "zenburn"
, "haddock"
, and "textmate"
.This website) uses the R Markdown Websites format to render multiple
.Rmd
documents in a single website. It uses theflatly
theme andzenburn
highlighting.
You can render your document into multiple output formats (HTML, Word document, PDF, etc.) by supplying a list of formats:
output:
html_document:
toc: true
toc_float: true
pdf_document: default
word_document:
toc: yes
If you don’t want to change any of the default options for a format, use the default
option. You cannot specify multiple formats like this:
output:
html_document:
toc: true
toc_float: true
pdf_document
You must assign some value to the second output format, hence the use of default
.
When rendering multiple output formats, you cannot just click the “Knit” button. Doing so will only render the first output format listed in the YAML. To render all output formats, you need to programmatically render the document using rmarkdown::render("my-document.Rmd", output_format = "all")
. Type ?render
in the console to look up the help file for render()
and see the different arguments the function can accept.
We don’t have to use R Markdown documents for all our work; in many instances, using a script might be preferable.
A script is a plain-text file with a .R
file extension. It contains R code and it helps to add comments using the #
symbol to explain to yourself (and others!) what you are doing. You edit scripts in the editor panel in R Studio.
Scripts are easier to troubleshoot than R Markdown documents because your code is not split across chunks and you can run everything interactively. When you first begin a project, you may find it easier to start with scripts to build and debug your code, and then convert your work to an R Markdown document once you begin the substantive analysis and writeup. Or you may use a mix of scripts and R Markdown documents depending on the size and complexity of your project. For instance, you could use a reproducible pipeline which uses a sequence of R scripts to download, import, and transform your data, then use an R Markdown document to produce a final report.
If you want to learn more, the people at RStudio have produced a brilliant R Markdown tutorial.
This page last updated on: 2020-07-14