5.1 Reproducibility in scientific research

Reproducibility is the idea that data analyses, and more generally scientific claims, are published with their data and software code so that others may try to replicate the same work, get similar results, and build upon the works of others.

While this sounds obvious, it actually happens far less frequently than what it should.

For instance, scientists at the biotechnology company Amgen were unable to replicate the majority of published pre-clinical cancer research studies; as a matter of fact, only 6 out of 53 landmark results could be reproduced. Similarly, it has been argued that the great majority of preclinical results cannot be reproduced, leading to an annual estimate of the cost of irreproducibility on preclinical research industry to be equal to 28 Billion USD.

“You are always working with at least one collaborator: Future you.”
      – Hadley Wickham

Suppose that your colleague sends you an Excel file with an analysis she has undertaken. The Excel file is likely to contain the raw data, but also graphs, results, etc. that were generated from the data. If you have ever received such an Excel analysis file, it takes a long time to navigate around it and try to understand the logic used to arrive at the results.

Data analysts who implement reproducibility in their projects can quickly and easily reproduce the original results and trace back to determine how they were derived. Literate programming, an idea from Donald Knuth, is a technique for mixing written text, where you write notes explaining what you did and why, and chunks of code that produce your graphs, analyses, etc.

This makes documentation of code easier, enables verification and replication, and allows the analyst to precisely replicate her analysis. This is extremely important when revisiting work done months later, because it’s highly likely you won’t remember how all the code/analysis works together when completing your work.

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to humans what we want the computer to do.
      – Donald E. Knuth (1984), Literate Programming

Reproducibility is also key for communicating findings with other and decision makers; it allows them to follow your logic and verify your results, assess your assumptions, and understand how your answers were formed rather than solely relying on your claimed results. In the data science framework employed in R for Data Science, reproducibility is infused throughout the entire workflow.

Your reproducibility goals should be:

  • Are the results (tables and figures) reproducible from the code and data?
  • Does the code actually do what you think it does?
  • Is the code well documented so someone else can foolow your work?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
  • Can the code be used for other, or newer, data?
  • Can you generalise the code to do other things?

5.2 R Markdown basics

Markdown is a lightweight markup language that allows you to quickly write and format text, which is then converted to different types of output.

R Markdown is one approach to ensuring reproducibility by providing a single cohesive authoring framework. It allows you to combine code, output, and analysis into a single document, are easily reproducible, and can be output to many different file formats. R Markdown is just one tool for enabling reproducibility.

rmarkdown and knitr is a powerful combination of packages for literate programming, reproducible analysis, and document generation, which can:

  • Combine R code and Markdown syntax
  • Produce documents in PDF , Microsoft Word and various types of HTML documents
  • In HTML format, it can incorporate “extras” like interactive graphics

An R Markdown file is a plain text file that uses the extension .Rmd and contains three (3) major components:

  1. A YAML header surrounded by ---s. This is the metadata of the document and it tells you how it is formed - what the title is, the author, date, output, and other control information.
  2. Chunks of R code surounded by ```
  3. Text mixed with simple text formatting using the Markdown syntax

Code chunks are interspersed with text throughout the document. To complete the document, you “Knit” or “render” the document. Most of you probably knit the document by clicking the “Knit” button in the script editor panel. You can also do this programmatically from the console by running the command rmarkdown::render("example.Rmd").

When you knit the document you send your .Rmd file to knitr, a package for R that executes all the code chunks and creates a second markdown document (.md). That markdown document is then passed onto pandoc, a document rendering software program independent from R. Pandoc allows users to convert back and forth between many different document formats such as HTML, \(\LaTeX\), Microsoft Word, etc. By splitting the workflow up, you can convert your R Markdown document into a wide range of output formats.

5.2.1 Text formatting with Markdown

There is a great online tutorial using the Markdown syntax. There is also a quick reference guide to Markdown built-in to RStudio. To access it, go to Help > Markdown Quick Reference.

5.2.2 Code chunks

Code chunks are where you store R code that will be executed. You can name a code chunk using the syntax ```{r name-here}. Naming chunks is a good practice to get into for several reasons. First, it makes navigating an R Markdown document using the drop-down code navigator in the bottom-left of the script editor easier since your chunks will have intuitive names. Second, it generates meaningful file names for any graphs created within the chunk, rather than unhelpful names such as unnamed-chunk-1.png. Finally, once you start caching your results (more on that below), using consistent names for chunks avoids having to repeat computationally intensive calculations.

A word of caution: If you use the same chunk name more than once, knitr will give you an error message and refuse to knit your Rmd document.

5.2.2.1 Customizing chunks

Code chunks can be customized to adjust the output of the chunk. Some important and useful options are:

  • eval = FALSE - prevents code from being evaluated. I use this in my notes for class when I want to show how to write a specific function but don’t need to actually use it.
  • include = FALSE - runs the code but doesn’t show the code or results in the final document. This is useful when you have setup code at the beginning of your document (loading packages, adjusting options, etc.) that may generate a lot of messages that are not really necessary to include in the final report.
  • echo = FALSE - prevents code from showing in the final output, but does show the results of the code. Use this if you are writing a paper or document for someone who cares more about the substantive results and less about the programming used to obtain them.
  • message = FALSE or warning = FALSE - prevents messages or warnings from appearing in the final document.
  • results = 'hide' - hides printed output.
  • error = TRUE - causes the document to continue knitting and rendering even if the code generates a fatal error. If you’re debugging your code, you might want to use this option. However, for the final version of your work, you do not want to allow errors to pass through unnoticed.

5.2.2.2 Caching

By default, every time you knit a document R starts anew and no previous results are saved.

If you have code chunks that run computationally intensive tasks, like running a ggpairs() correlation/scatterplot matrix in a large dataset, you might want to store these results to be more efficient and save time. If you use cache = TRUE, R will do exactly this. The output of the chunk will be saved to a specially named file on disk. Now, every time you knit the document the cached results will be used instead of running the code fresh.

5.2.3 Global options

Rather than setting these options for each individual chunk, you can make them the default options for all chunks by using knitr::opts_chunk$set(). Just include this in a code chunk (typically in the first code chunk in the document). So for example,

knitr::opts_chunk$set(
  echo = FALSE
)

hides the code by default in all code chunks. To override this new default, you can still declare echo = TRUE for individual chunks.

5.2.4 Inline code

Until now, you have only run code in a specially designated chunk. However you can also run R code in-line by using the `r ` syntax. You may want to run the code inline to name the number of variables or rows in a dataset in a sentence like:

There are XXX variables and YYY observations in the gapminder dataset.

You can call code “inline” like this:


There are `r ncol(gapminder)` variables and `r nrow(gapminder)` observations in the `gapminder` dataset.
There are 6 variables and 1704 observations in the `gapminder` dataset.

What is great about this is that if your data changes, say a new version of gapminder comes out, then you don’t need to worry where you mentioned the number of variables/observations of data, you just re-knit your Rmd document.

5.3 YAML header

Yet Another Markup Language, or YAML (rhymes with camel) is a standardized format for storing hierarchical data in a human-readable syntax. The YAML header controls how rmarkdown renders your .Rmd file. A YAML header is a section of key: value pairs surrounded by --- marks.

---
title: "My report"
author: "Kostis Christodoulou"
date: 2019-06-30
output: html_document
---

The most important option is output, as this determines the final document format. However there are other common options such as providing a title and author for your document and specifying the date of publication.

5.4 Output formats

5.4.1 HTML document

For your homework assignments, we have used github_document to generate a Markdown document. However there are other document formats that are more commonly used.

output: html_document produces an HTML document. The nice feature of this document is that all images are embedded in the HTML file itself, so you can email just the .html file to someone and they will be able to open and read it.

5.4.2 Table of contents

Each output format has various options to customize the appearance of the final document. One option for HTML documents is to add a table of contents through the toc option. To add any option for an output format, just add it in a hierarchical format like this:

---
title: "My report"
author: "Kostis Christodoulou"
date: 2019-06-30
output:  
  html_document:
    toc: true
    toc_depth: 2

You can explicitly set the number of levels included in the table of contents with toc_depth (the default is 3).

5.4.3 Appearance and style

There are several options that control the visual appearance of HTML documents.

  • theme specifies the Bootstrap theme to use for the page (themes are drawn from the Bootswatch theme library). Valid themes include "default", "cerulean", "journal", "flatly", "readable", "spacelab", "united", "cosmo", "lumen", "paper", "sandstone", "simplex", and "yeti".
  • highlight specifies the syntax highlighting style for code chunks. Supported styles include "default", "tango", "pygments", "kate", "monochrome", "espresso", "zenburn", "haddock", and "textmate".

This website) uses the R Markdown Websites format to render multiple .Rmd documents in a single website. It uses the flatly theme and zenburn highlighting.

5.5 Multiple formats

You can render your document into multiple output formats (HTML, Word document, PDF, etc.) by supplying a list of formats:

output:
  html_document:
    toc: true
    toc_float: true
  pdf_document: default
  word_document:
    toc: yes   

If you don’t want to change any of the default options for a format, use the default option. You cannot specify multiple formats like this:

output:
  html_document:
    toc: true
    toc_float: true
  pdf_document

You must assign some value to the second output format, hence the use of default.

5.5.1 Rendering multiple outputs programmatically

When rendering multiple output formats, you cannot just click the “Knit” button. Doing so will only render the first output format listed in the YAML. To render all output formats, you need to programmatically render the document using rmarkdown::render("my-document.Rmd", output_format = "all"). Type ?render in the console to look up the help file for render() and see the different arguments the function can accept.

5.6 R scripts

We don’t have to use R Markdown documents for all our work; in many instances, using a script might be preferable.

5.6.1 What is a script and when to use it?

A script is a plain-text file with a .R file extension. It contains R code and it helps to add comments using the # symbol to explain to yourself (and others!) what you are doing. You edit scripts in the editor panel in R Studio.

Scripts are easier to troubleshoot than R Markdown documents because your code is not split across chunks and you can run everything interactively. When you first begin a project, you may find it easier to start with scripts to build and debug your code, and then convert your work to an R Markdown document once you begin the substantive analysis and writeup. Or you may use a mix of scripts and R Markdown documents depending on the size and complexity of your project. For instance, you could use a reproducible pipeline which uses a sequence of R scripts to download, import, and transform your data, then use an R Markdown document to produce a final report.

5.7 R Markdown Tutorial

If you want to learn more, the people at RStudio have produced a brilliant R Markdown tutorial.