STAT 113 Lab 2: RMarkdown and Descriptive Statistics

Partnership Roles

One of you (the person in column “A” in the table) will be the “reader”. Your job is to have this document open, read it aloud.
The other (in column “B”) is the “typist”. You are the one who, for this assignment, does the typing. The typist can ask the reader to stop, and can ask clarification questions, but should type only what the reader tells them to type.
In places where you need to write your own code, discuss it together, but then the reader should make the final call as to what gets typed.

This should help you slow down, read carefully, pay attention to detail, and make sure you understand what you’re reading and writing.

Log in to the Oberlin RStudioPro Server and Create a Project

In your web browser, visit rstudiopro.oberlin.edu.
Your username is your Obie ID. You should have changed your password last week, but if not it is also your Obie ID. See Lab 1 for instructions on changing it.
Click the Project dropdown menu in the top right of the RStudio window and select New Project. Call the project lab2 and make a directory for it inside your stat113 folder.

RMarkdown

Last week we stored our commands in a file called a script, which was just a text file with a command (or a comment) on each line. This week we will use a format called RMarkdown that helps us keep our work better organized. It is basically like a Word document with some simple plain text syntax to control formatting, with mini R scripts interspersed, each followed by whatever output it produces (it is also what I use to make these lab documents).

Open the file RMarkdown-template.Rmd in your stat113 folder.
Save it as lab2.Rmd in your project directory.
Read through the .Rmd file to get a basic sense of how RMarkdown works.
As you work through this lab, add your code and text to the .Rmd file as instructed in the template. This will help you keep things organized.

Don’t move on until you have read through the .Rmd template and understand where and how to add your work to it!

As you work through the commands listed in this packet, you should type the 
commands in your `.Rmd` file. Make a new code chunk for each chunk of code 
as it appears here, and run that chunk using the "Run Chunk" button at the 
upper right of the shaded region.

Any text that you write that isn't code can go in your .Rmd file *not* in 
a code chunk. Unlike in an R Script, you do not need to precede verbal 
commentary with the # symbol, as long as it is not within a chunk.

Warmup

Examine the documentation for the EmployedACS dataset from the Lock5Data package, including the description of the values of each variable. Which ones are quantitative? Which are categorical? Which are binary?

Describing and Visualizing Data

In this section we will look at how to create some basic visualizations and get some basic summary statistics in R.

Graphing a Quantitative Variable

We are going to be using functions from the following packages, so if you haven’t already, load them at the start of your script (pay attention to capitalization):

mosaic
ggformula
Lock5Data

library(mosaic)    # Needed for the various functions
library(ggformula) # Needed for some plotting functions
library(Lock5Data) # Needed for the EmployedACS data

Now, load the EmployedACS dataset from the Lock5Data package. Since the package is already loaded, we can go straight to the data() line.

data(EmployedACS)  # Loading the dataset

R Tip:

As a rule, any time you start a new RMarkdown file, you should load any
libraries you need in a chunk near the top of the file. The template
I provided loads mosaic and ggformula for you already. Don't forget
to load any datasets you need, using the method appropriate to the 
source of the data, (data() for datasets from packages, 
read.file() for data from files).

Dot Plots and Histograms

For a quantitative variable like HoursWk (Hours worked per week), we can get a general sense of what values are in the data using a dot plot or a histogram.

## The gf_ prefix indicates that we are using a plot function 
## from the ggformula package, which gives us a consistent template across
## different plot types.

## Unfortunately with this particular plotting function, the y axis does not
## show the counts.
gf_dotplot(~HoursWk, data = EmployedACS)

You should see the plot appear in the lower-right “Plots” pane. Dots represent cases, and they are stacked on an axis representing hours worked. It is the same idea as a histogram, but depicting individual data points.

Let’s examine the structure of the command you just typed. The first argument here (with the $\sim$ in front) is a formula, with the name of the variable we are interested in plotting appearing on the right hand side of the tilde ($\sim$) symbol. We will see this format appear over and over again.

We will also see the data= argument appear over and over again. This indicates what data frame the variable we are interested in is in.

Likely the stacks of dots are rounded off at weird intervals, and the dots are an odd size. We can modify the gf_dotplot() command to control the rounding that gets done and the size of the dots with the binwidth= and dotsize= arguments, respectively.

gf_dotplot(~HoursWk, data = EmployedACS, binwidth = 1, dotsize = 0.25)

This says to group the dots in bins spanning a range of 5 hours, and to magnify the default dot size by a factor of 3. Alternatively, we could use nint= instead of width= to control the number of bins instead of the width of each bin.

Alternatively, we could create a histogram:

gf_histogram(~HoursWk, data = EmployedACS, type = "count", binwidth = 5)

You should again see a horizontal axis representing number of hours worked, but now instead of individual dots, there are bars. The height of each bar represents the number of cases in a particular range.

Note that the command has similar structure as dotPlot(): we have a “formula” giving the variable name we are plotting, a data= argument specifying the dataset that variable is in. Some of the optional arguments have different names, and we also have a type= argument to control whether we’re plotting counts or density on the $y$-axis.

Optional: Controlling Color

You can make your plot (possibly) prettier by adding a bit of color. In this case, we use the fill= argument to control the color that fills in the bars.

gf_histogram(~HoursWk, data =  EmployedACS, binwidth = 5, fill = "green")

You can type colors() at the console to see a (long) list of available colors.

Alternatively you can use a predefined color palette and then refer to colors by number. For example, try this:

library("RColorBrewer")  # a package with some decent palettes
display.brewer.all()     # Shows names of some nice color palettes

## Sets the palette to the first 8 colors in "Set1"
## Type ?brewer.pal to see other palette options
palette(brewer.pal(n = 8, name = "Set1"))

Note:

When you run the above chunk, the palette image will be squashed vertically. 
To be able to read the palette names, you can "pop out" the image with the 
"Show in New Window" button at the upper right of the image. To change 
the dimensions of the image in your Knitted document, you can modify the 
chunk options with the "Gear" button. Choose "Use custom figure size", 
and experiment with measurements until you get a size that looks good.

Now if we give colors by number it will use the palette (note, however, that if we pick a value for n which is less than the number of colors in the palette, it may not choose the first n colors; instead, it will choose n colors that are roughly equally spaced across the palette range).

gf_histogram(~HoursWk, data = EmployedACS, binwidth = 5, fill = 3)

Before moving on, create a histogram of the Income variable in this dataset. Describe the shape of the distribution and estimate the mean and median income of the individuals in this data by eyeballing the plots (you will likely want to examine the documentation for the dataset to see how income is measured).

We can also create a box-and-whisker plot for a quantitative variable, which depicts the median, the first quartile (the value that has 25% of the cases below it, in other words the 25th percentile, or the “median of the lower half of the cases”), and the third quartile (the value with 3/4 or 75% of the data below it, which is the median of the upper half of cases).

The “whiskers” extend out from the box, to a width at most 1.5 times the width of the box. Points beyond the whiskers are depicted individually, identified as potential outliers

We’ll talk about all this more formally in class tomorrow.

## By default gf_boxplot() makes vertical boxplots. But to be consistent with
## the orientation of our histograms, we'd prefer horizontal. The following
## code modifies the plot by flipping the coordinates.
## The %>% is called a "pipe": it sends the output of one command into
## the input of the next command.
gf_boxplot(~Income, data = EmployedACS) %>% 
  gf_refine(coord_flip())

Note that what is now the $y$ axis here has meaningless tick marks. Set that aside for now.

Estimate the first and third quartile of the Income variable from the box-and-whisker plot.

Summarizing a Quantitative Variable

We can get a set of useful summary statistics for our HoursWk variable using the favstats() function. The syntax is the same as for gf_dotplot() and gf_histogram(), in that we have a formula and a data= argument.

favstats(~HoursWk, data = EmployedACS)

Find the mean, median, Q1 and Q3 for the Income variable. Were the estimates you made above by eyeballing the plots reasonably close?

We can get individual summary statistics using the same format. For example,

mean(~HoursWk, data = EmployedACS)
median(~HoursWk, data = EmployedACS)
sd(~HoursWk, data = EmployedACS)      # Standard Deviation
var(~HoursWk, data = EmployedACS)     # Variance

If you don’t know what standard deviation and variance are, don’t worry; we will talk about them tomorrow too.

Summarizing a Categorical Variable

Frequency and Relative Frequency Tables

Mean, median, etc. do not make sense for a categorical variable, since we cannot add the values together, and they are not ordered.

Instead, we can summarize the variable with a frequency table:

tally(~Race, data = EmployedACS)

Or, we can get a relative frequency table by specifying format = "proportion"

tally(~Race, data = EmployedACS, format = "proportion")

Using the relative frequency table, find the total percentage of the sample that is non-white? You can use R as a calculator (for example, try typing 2 + 2).

Contingency Tables and Conditional Proportions

We can get a contingency table showing combinations of levels of two categorical variables (for example, Sex and Race) as follows

tally(Race ~ Sex, data = EmployedACS, format = "count")

If we want to condition on (group by) one of the variables, and see what proportion of each group is in each category of the other variable, we can do this as follows:

## The vertical bar is the symbol for "grouped by" or "conditioned on".
## The variable to the right is the one defining the groups.
tally(~Sex | Race, data = EmployedACS, format = "proportion")

What contrasts jump out at you, looking at this table?
Construct a table of conditional proportions to help you answer the following question: How does the proportion of workers without health insurance differ between U.S. citizens and non-U.S. citizens in this data?

Visualizing a Categorical Variable

Bar Graphs

We can plot counts or proportions using a bar graph (and definitely not a pie chart…)

## Counts version
gf_bar(~Race, data = EmployedACS)
## Proportions version
gf_props(~Race, data = EmployedACS)
## Percentages version
gf_percents(~Race, data = EmployedACS)

Or, with fill colors determined by the Race variable:

gf_percents(~Race, data = EmployedACS, fill = ~Race)

(Note that to have fill color depend on a variable, we use formula syntax here, rather than providing a single fixed color name in quotes)

Create a bargraph showing the proportions of cases that do and do not have health insurance. Is there anything unsatisfying about the graph if you were going to put it in an article or paper?

Modify your graph by adding a label for the $x$-axis by adding an xlab= argument to the plot function (you might need to refer back to Lab 1 to remind yourself how arguments work). Put the desired label in quotes as the argument value so the reader knows what the bars represent.

Visualizing Two-Way Contingency Tables

If we have a contingency table with two categorical variables, we can visualize it in a couple of ways. One is to use a grouped bar graph:

## The vertical bar means "conditioned on" or "grouped by".
## Unfortunately since Sex is encoded as 0 or 1, this is what gets displayed.
## This can be fixed, but it's not super simple, so we'll skip it for now
gf_percents(~Race | Sex, data = EmployedACS, fill = ~Race)

Quantitative Variables, Split by Group

As we’ve seen, we often want to know about proportions of the categories from one variable conditioned on (grouped by) the categories of another. We might similarly be interested in the distribution of a quantitative variable in each of several subsets of the data, where the subsets are defined by another categorical variable.

Plotting data by group

For example, we might be interested in the distribution of the number of hours worked, separating the data into married and not married individuals. We might want to look at this using boxplots. It is possible to use the same syntax we used above:

gf_boxplot(~HoursWk | Married, data = EmployedACS) %>% 
  gf_refine(coord_flip())

but it’s hard to compare the two plots side by side like this. The following syntax produces a similar plot but grouped in a more convenient way.

## We use factor() here to tell R that Married is categorical. Since it (like Sex) is 
## coded as 0 or 1 in the data, R does not know this automatically.
gf_boxplot(HoursWk ~ factor(Married), data = EmployedACS) %>% 
  gf_refine(coord_flip())

What differences jump out between the distribution of hours worked for married vs unmarried individuals? Speculate about what might account for this. Why do you suppose the dark line depicting the median is right at the edge of the box with this data? (Hint: look back at the histogram of the HoursWk variable you created above)

Summary statistics by group

We can use the same command structure to find means, medians, etc. split by group.

favstats(~HoursWk | Married, data = EmployedACS)
mean(~HoursWk | Married, data = EmployedACS)

Compute the mean income for workers with and without health insurance.
Produce side-by-side boxplots with a single shared axis showing the distribution of income, grouped by whether the person has health insurance (you will want to use factor() to tell R to treat the grouping variable as categorical).
Suppose a new person (not in this dataset) filled out the survey and reported income of $80K but neglected to indicate whether they have health insurance. Based on the side-by-side boxplots, would you guess that they have health insurance or not? How confident would you be? Explain.

Turning in your Work

When you’re finished, Knit your finished .Rmd file. Once you have corrected any errors you are getting and get your file to Knit successfully, save both the source file (as lab2.Rmd) and the Knitted .html output file (lab2.html) that gets automatically created to the ~/stat113/turnin/lab2/ folder. Only one file per partnership is needed if you completed the whole lab together. If you run out of time in class and are unable to meet to finish together, the typist can send a copy of this file to their partner, and then both can continue and each submit a file which is partly identical.

If you forget how to download a file from the server so that you can send it to your partner, see Lab 1 (section “Files, Plots, Packages, Help Viewer”) for a reminder.