The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and also to analyze real data and come to informed conclusions. To clarify which is which: R is the name of the programming language itselfand RStudio is a convenient interface.

As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better data scientist. Before we get to that stage, however, you need to build some basic fluency in R.

Before we start, let’s get RStudio running.

There are two ways you can run R and RStudio:

Option 1 is to log in to the Oberlin RStudio server from a web browser. Option 2 is to install the software on your own computer.

Advantages of the server approach: you can get to your account, and your files, from anywhere, and you don’t have to install anything.

Disadvantages of the server approach: you have to upload and download files to the server if you create them on your computer and want to use them in RStudio, or if you want to do something with a file that you created in RStudio, there may be occasional server outages or slow downs when a lot of people are using it at once.

Unless you already have R and RStudio installed on your machine and are used to using it there, you should log in to the server.

Here’s how to do it
* In your web browser, visit rstudio.oberlin.edu. * If you had an account already, or you filled in the background survey, your username should be your Obie ID (the short form of your email, not including the @oberlin.edu), and, unless you have changed it, the initial password is the same as your username. Let me know if you don’t have an account yet. * Before moving on, you should change your password (unless you have done so before). From the “Tools” menu in RStudio (not your browser), “Shell…”. A window will open with a blinking cursor. Type passwd. You will be prompted to enter your old password, followed by a new password twice. As you type you will not see anything appear, nor will the cursor move, but the computer is registering your keystrokes. When done you can close this window.

RStudio Basics

Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

rinterface

rinterface

The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered. Any plots that you generate will show up in the panel in the lower right corner.

The panel on the left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered.

Any plots that you generate will show up in the panel in the lower right corner. This is also where you can browse your files, access help, manage packages, etc.

R Packages

R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R packages:

These packages are already available on the RStudio server. If you are working on your own local installation of R and RStudio, you may need to install them, by typing the following line of code into the console of your RStudio session and pressing the enter/return key. Note that you can check to see which packages (and which versions) are installed by inspecting the Packages tab in the lower right panel of RStudio.

install.packages("tidyverse")

You may need to select a server from which to download; any of them will work. You only need to run the above line once.

Next, you need to load these packages in your working environment. You will need to do this once for each new lab, project, or report that you create.

We do this with the library() function. Run the following line in your console.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Loading Data from a File

To get you started, run the following command to load the data from a comma-separated value (CSV) file hosted online.

arbuthnot <- read.csv("http://www.openintro.org/stat/data/arbuthnot.csv")

This command instructs R to fetch the data from the given URL.

You can also point to a file on your computer by supplying the file path in place of the URL. For example, something like "/Users/cdawson/data/some_fun_data.csv".

The data consists of the Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot that has 82 observations on 3 variables.

As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed. Note that because you are accessing data from the web, this command (and the entire assignment) will work in a computer lab, in the library, or in your dorm room; anywhere you have access to the Internet.

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the console.

arbuthnot

However printing the whole dataset in the console is not that useful. One advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot in the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the “X” in the upper lefthand corner.

What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.

Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a kind of spreadsheet or table called a data frame.

You can see the dimensions of this data frame by typing:

glimpse(arbuthnot)

This command should output the following

## Observations: 82
## Variables: 3
## $ year  <int> 1629, 1630, 1631, 1632, 1633, 1634, 1635, 1636, 1637, 16...
## $ boys  <int> 5218, 4858, 4422, 4994, 5158, 5035, 5106, 4917, 4703, 53...
## $ girls <int> 4683, 4457, 4102, 4590, 4839, 4820, 4928, 4605, 4457, 49...

We can see that there are 82 observations (aka “cases”) and 3 variables in this dataset. The variable names are year, boys, and girls. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The glimpse() command, for example, took a single argument, the name of a data frame.

Some Exploration

Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like

pull(arbuthnot, boys)

This command will only show the number of boys baptized each year. In a sense, pull() extracts the variable boys from the data frame arbuthnot.

  1. What command would you use to extract just the counts of girls baptized? Try it!

Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector. And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.

Data visualization

R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command

qplot(x = year, y = girls, data = arbuthnot)

The qplot() function (meaning “quick plot”) considers the type of data you have provided it and makes the decision to visualize it with a scatterplot. The plot should appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with three arguments separated by commas. The first two arguments in the qplot() function specify the variables for the x-axis and the y-axis and the third provides the name of the data set where they can be found. If we wanted to connect the data points with lines, we could add a fourth argument to specify the geometry that we’d like.

qplot(x = year, y = girls, data = arbuthnot, geom = "line")

You might wonder how you are supposed to know that it was possible to add that fourth argument. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in. Try the following.

?qplot

Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.

  1. Is there an apparent trend in the number of girls baptized over the years? How would you describe it?

Functions, arguments and commands

Most of what we do in R consists of applying functions to data objects, specifying some options for the function, which are called arguments. Together, the application of a function, together with its arguments, is called a command.

A useful analogy is that commands are like sentences, where the function is the verb, and the arguments (one of which usually specifies the data object) are the nouns. There is often a special argument that comes first. This is like the direct object of the command.

For example, in the English command, “Draw a picture for me with some paint”, the verb “draw” acts like the function (what is the listener supposed to do?); the noun “picture” is the direct object (draw what?), and “me” and “paint” are extra (in this case, optional) details, that we might call the “recipient” and the “instrument”.

In the grammar of R, I could write this sentence like:

## Note: this is not real R code
draw("picture", recipient = "me", material = "paint")

We are applying the function draw() to the object "picture", and adding some additional detail about the recipient and material. Here the function is called draw, and we have a main argument with the value "picture", and additional arguments recipient and material with the values "me", and "paint", respectively.

Technically speaking, "picture" is the value of an argument too; we might have written

### Note: this is not real R code
draw(object = "picture", recipient = "me", material = "paint")

However, in practice, there is often a required first “main” argument whose name is left out of the command.

In R, arguments always go inside parentheses, and are separated by commas when there is more than one. For arguments whose names are explicitly given, the name goes to the left of the =, and the value goes to the right.

Creating new variables

Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like

5218 + 4683

to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys and girls, R will compute all sums simultaneously.

pull(arbuthnot, boys) + pull(arbuthnot, girls)

What you will see are 82 numbers (in that packed display, because we aren’t looking at a data frame here), each one representing the sum we’re after. Take a look at a few of them and verify that they are right. Therefore, we can make a plot of the total number of baptisms per year with the command

To keep our data organized, and to facilitate plotting, we want to save this variable as a column in our data frame. We can do this as follows:

arbuthnot <- arbuthnot %>% mutate(total = boys + girls)

The %>% operator is called the “piping” operator. It takes the output of the previous expression and pipes it into the first argument of the function in the following one. To continue our analogy with mathematical functions, x %>% f(y) is equivalent to f(x, y).

We can read this command as the following:

“Take the arbuthnot dataset and pipe it into the mutate() function. Mutate the arbuthnot data set by creating a new variable called total that is the sum of the variables called boys and girls. Then assign the resulting dataset to the object called arbuthnot, i.e. overwrite the old arbuthnot dataset with the new one containing the new variable.

When you make changes to variables in your dataset, click on the name of the dataset again to update it in the data viewer.

The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. In this case, you already have an object called arbuthnot, so this command updates that data set with the new mutated column.

We can make a plot of the total number of baptisms per year with the command

qplot(x = year, y = total, data = arbuthnot, geom = "line")
  1. Combine the mutate() idea and the qplot() syntax to create a plot of the proportion of boys born over time. What do you see?

Getting credit

To get credit for this lab, respond to the following prompt by Direct Message to me on Slack by Monday at 2:30 P.M.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was originally adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics, and further adapted by Colin Dawson.