Welcome!

This introductory resource for Data Analytics with R has two goals: to get people started with R and RStudio, and to provide a brief sketch of some of the things that R can be useful for in business and management. All along, we will use the tidyverse, an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Data science is the discipline of turning data into understanding and we will learn how to do data science by following the workflow, introduced by Garrett Grolemund and Hadley Wickham in their R for Data Science book.

Data science workflow. Source: R for Data Science by Garrett Grolemund and Hadley Wickham.
  1. Import
    • We need to get our data into whatever software package we use to analyze it. Most of us are familiar with data stored in Excel spreadsheets; however, a lot of interesting data cannot be obtained in a single specific and simple format. Information you need could be stored in a database, or a webpage. You need to know how to convert/extract information into R. For instance, if we wanted to study climate change, we can find data on the Combined Land-Surface Air and Sea-Surface Water Temperature Anomalies in the Northern Hemisphere at NASA’s Goddard Institute for Space Studies. How can we import this tabular data in R? (Hint: use read_csv() from the readr package).
  2. Tidy, or reshape your data
    • Tidying your data means to store it in a form that enables the use of the tidyverse to analyze the data. When data is tidy, each column is a variable, and each row is an observation. Looking back at the NASA Temperature Anomalies data, the variables we care about are month, year, and temperature anomaly. However, the data is not in tidy format, but in a generic tabular form, where columns do not correspond to a variable (month), but the specific value that variable month takes. By storing data in a tidy format, we can focus our efforts on questions about the data and not constantly manipulate the data into different forms. Contrary to what you might expect, the majority of an analyst’s time and effort is spent cleaning and wrangling data into a tidy format for analysis. While not glamorous, tidying is an important stage.
  3. Transform
    • Transforming data can take on different tasks. Typically these include subsetting the data to focus on one specific set of observations, creating new variables that are functions of existing variables, calculating summary statistics for the data, etc. In the temperature anomalies data, if we only had monthly data, how could we calculate monthly and yearly averages?
  4. Visualise
    • Humans love to visualize information, as it reduces the complexity of the data into an easily interpretable format. There are many different ways to visualise data - knowing how to create specific graphs is important, but even more so is the ability to determine what visualizations are appropriate given the variables you wish to analyse. We will use the ggplot2 package which has become the de-facto standard for visualisation in R. Using the gapminder package, how do we visualise changes of GDP and life expectancy over time?
  5. Model
    • While visualisations are intuitive, they do not scale well to complex relationships. Visualising two or three variables may be straightforward, but once you are dealing with four or more variables visualisations become pointless. Models are fundamentally mathematical, so they scale well to many variables. We will look at a variety of models in the core course.
  6. Communicate
    • You need to not only understand your data, but also communicate your results and tell a story to a larger audience, so that others can learn from your analysis and knowledge. A key element of communication is the ability to generate reproducible reports with R Markdown


This is still work in progress, but I hope it is useful. Please drop me an email if you find an error or if you need any help.

–Kostis Christodoulou



This page last updated on: 2020-07-14

4: Statistical Learning