STAT 113 Lab 4: Describing Categorical and Quantitative Variables

Partnership Roles

Whichever group member has the name which is second alphabetically is the default “screen-sharer”
Whichever person has the ObieID which is third alphabetically (or first, if it’s a group of two) will be the reader for this lab. You should read the document aloud as you go. I realize this may feel a bit awkward, but it will help you slow down, not miss important information, and stay together. Don’t just skip to the Exercises! The examples in between are often very similar to what the exercises are asking you to do; if you skip ahead your are less likely to know what to do!

Log in to the Oberlin RStudioPro Server and Create a Project

To create a project, click the Project dropdown menu in the top right of the RStudio window and select New Project. Call the project lab4 and make a directory for it wherever you have been putting project folders.
With this file open, do File > Save As… and save a copy of lab4.Rmd in your project folder. Make your edits there.

Using RMarkdown

As you work through the lab, you should run the example chunks as you go. Use the “Run Chunk” button at the top right of the chunk.

For numbered exercises, put your text and code between the — markers and below the ##### lines. Leave a blank line before and after the ``` lines starting and ending code chunk as well as before and after the — markers. Like this:

YOUR ANSWER HEADER

Here’s what I think about this!

code in here

Here’s another thing I think about that!

To create a new code chunk you can either copy and paste an existing chunk, type the delimiter lines directly, or use the “Insert Chunk > R” button at the top of the editor pane (the green square with a white C). There is also a keyboard shortcut to insert a new chunk.

Warmup

Examine the code book for the EmployedACS dataset from the Lock5Data package. You will first need to load the package with library(PackageName) ,then the dataset, with data(DatasetName), then finally, pull up the code book with help(DatasetName). The library() and data() commands should be in a code chunk, but the help() command is best typed at the console so it doesn’t run when you Knit. You can refer back to Lab3 for reference if needed. Which variables are quantitative? Which are categorical?

SOLUTION to Exercise 1

Describing and Visualizing Data

In this section we will look at how to create some basic visualizations and get some basic summary statistics in R.

Graphing a Quantitative Variable

We are going to be using functions from the following packages. Make sure they are loaded via the library() command. You only need to load each package once per document (but it doesn’t hurt anything to do it again).

mosaic
tidyverse
Lock5Data

## mosaic and should be loaded in the setup chunk above already. Make sure you run it though!
## Use the "Run All Chunks Above" button at the top right of this chunk
library(Lock5Data)

Having loaded the package, we can load the dataset with data() (you should have done this in Exercise 1, but here’s the code just in case)

data(EmployedACS)  # Loading the dataset

Histograms

For a quantitative variable like HoursWk (Hours worked per week), we can get a general sense of what values are in the data using a histogram. Below is a command that produces a histogram. You should see the plot appear below the chunk.

## Pay attention to capitalization. Pretty much everything in R is case-sensitive.
gf_histogram(~HoursWk, data = EmployedACS)

The “formula” and “data” pattern

Notice how the command above is structured: the function name is gf_histogram(). Many plotting functions will start with gf_, which is shorthand for the name of a convenient plotting package called ggformula. The function name describes what we are getting.

Then, the first argument has the name of the variable we are graphing, preceded by a tilde (~).

Finally, we have the named data= argument, which is set to the name of the dataset. We will see this same pattern over and over for different objectives (plots, summary statistics, etc.).

Here’s a general template for many of the commands we will use (this is not actually runnable code):

goal(~VariableName, data = DataSetName)

Modifying the bins

We can modify the gf_histogram() command to control how many “bins” we get, using the optional bins= argument (set it to the desired number of bins). Alternatively we can set binwidth= to a desired width.

gf_histogram(~HoursWk, data = EmployedACS, bins = 40)

gf_histogram(~HoursWk, data = EmployedACS, binwidth = 1)

Optional: Controlling Color

You can make your plot (possibly) prettier by adding a bit of color. In this case, we use the fill= argument to control the color that fills in the bars.

gf_histogram(~HoursWk, data =  EmployedACS, bins = 40, fill = "green")

You can type colors() at the console to see a (long) list of available color names.

Before moving on, create a histogram of the Income variable in this same dataset. Describe the shape of the distribution and estimate the mean and median income of the individuals in this data by eyeballing the plots (you will likely want to examine the documentation page for the dataset, using help() or the ? syntax to see how Income is measured).

SOLUTION to Exercise 2

Box and Whisker Plots

We can also create a box-and-whisker plot for a quantitative variable.

gf_boxplot(~Income, data = EmployedACS)

Notice that the pattern of this command is the same as the previous ones!

Visually estimate the first and third quartile of the Income variable from the box-and-whisker plot.

SOLUTION to Exercise 3

Summarizing a Quantitative Variable

We can get a set of useful summary statistics for the HoursWk variable using the favstats() function. The syntax is the same as for gf_histogram(), and gf_boxplot(), in that we have a formula (containing a variable name preceded by ~) and a data= argument (containing the name of the dataset).

favstats(~HoursWk, data = EmployedACS)

##  min Q1 median Q3 max     mean       sd    n missing
##    2 35     40 42  88 37.86713 12.94576 1287       0

Find the actual mean, median, Q1 and Q3 for the Income variable. Were the estimates you made above by eyeballing the plots reasonably close?

SOLUTION to Exercise 4

We can get individual summary statistics using the same format. For example,

mean(~HoursWk, data = EmployedACS)

## [1] 37.86713

median(~HoursWk, data = EmployedACS)

## [1] 40

sd(~HoursWk, data = EmployedACS)      # Standard Deviation

## [1] 12.94576

IQR(~HoursWk, data = EmployedACS)

## [1] 7

Summarizing a Categorical Variable

Frequency and Relative Frequency Tables

We can summarize a categorical variable such as Race using a frequency table. The tally() function will do the counting for us and create the table.

tally(~Race, data = EmployedACS)

## Race
## asian black other white 
##    92   116   102   977

We can get a relative frequency table instead by specifying format = "proportion" for proportions on a 0 to 1 scale, or or format = "percent" for percentages on a 0 to 100 scale.

tally(~Race, data = EmployedACS, format = "percent")

## Race
##     asian     black     other     white 
##  7.148407  9.013209  7.925408 75.912976

Using the relative frequency table, find the total percentage of the cases that are listed as asian, black, or other by using R as a calculator. Don’t worry about extracting the values directly from the table for now; for the time being just type in the values by looking at the table. Assign the total percentage to a variable called pctPoC.

SOLUTION to Exercise 5

Contingency Tables and Conditional Proportions

We can get a contingency table showing combinations of levels of two categorical variables (for example, HealthInsurance and Race) as follows:

tally(Race ~ HealthInsurance, data = EmployedACS, format = "count")

##        HealthInsurance
## Race      0   1
##   asian   6  86
##   black  10 106
##   other  22  80
##   white  70 907

The HealthInsurance variable is coded as 0 or 1. Refer to the code book to interpret these.

With two variables, we will often see formulas that look like this, with one variable name to the left of the tilde (~), which will generally be the “y” variable, and the other one to the right, which is like the “x” variable.

If we want to condition on (group by) one of the variables, and see what proportion of each group is in each category of the other variable — for example, to look at what percentage of respondents of each race are male and what percentage of respondents of each race are female, we can do this as follows:

## The vertical bar is the symbol for "grouped by" or "conditioned on".
## The variable to the right is the one defining the groups.
tally(~HealthInsurance | Race, data = EmployedACS, format = "percent")

##                Race
## HealthInsurance     asian     black     other     white
##               0  6.521739  8.620690 21.568627  7.164790
##               1 93.478261 91.379310 78.431373 92.835210

Notice that the percentages add up to 100 within each level of the grouping variable, but not within levels of the other variable.

What jumps out at you here?

SOLUTION to Exercise 6

Construct a table of conditional proportions/percentages to help you answer the following question: “How does the percentage of U.S. citizens who lack health insurance differ from the proportion of non-U.S. citizens who lack health insurance in this data?” Refer to the documentation to see what the variables are called and how they are represented.

SOLUTION to Exercise 7

Visualizing a Categorical Variable

Bar Graphs

We can plot counts or proportions using a bar graph (and definitely not a pie chart…)

## Counts version
gf_bar(~Race, data = EmployedACS)

## Proportions version
gf_props(~Race, data = EmployedACS)

## Percentages version
gf_percents(~Race, data = EmployedACS)

Or, with fill colors determined by the Race variable (note that having the color vary by the value of a variable uses the formula syntax too):

gf_percents(~Race, data = EmployedACS, fill = ~Race)

Create a bar graph showing the proportions of cases that do and do not have health insurance. Is there anything unsatisfying about the graph if you were going to put it in an article or paper?

Solution to Exercise 8

We can modify the graph in two ways:

First, we can tell R that HealthInsurance is categorical using the factor() function.
Next, we can add a label for the $x$-axis by adding an xlab= argument to the gf_bar() function. Put the desired label in quotes as the value of the xlab= argument so the reader knows what the bars represent:

gf_props(
  ~factor(HealthInsurance), 
  data = EmployedACS, 
  xlab = "Health Insurance Status (1 = has insurance; 0 = does not have insurance)")

Visualizing Two-Way Contingency Tables

If we have a contingency table with two categorical variables, we can visualize it in a couple of ways. One is to use a grouped bar graph:

## The vertical bar means "conditioned on" or "grouped by".
## Unfortunately since HealthInsurance is encoded as 0 or 1, this is what gets displayed.
## This can be fixed, but it's not super simple, so we'll skip it for now
gf_percents(
  ~factor(HealthInsurance) | Race, 
  data = EmployedACS, 
  fill = ~factor(HealthInsurance))

It might be easier to see what’s going on in this graph if we use stacked bars. Getting stacks with the conditional percentages we want is a bit unintuitive, but here’s how to do it. While we’re at it I’ll fix the legend. Don’t feel you need to know how to do the tweaks below, but in case you want to, it’s here for reference.

gf_percents(
  ~Race,                            ## defines the 'x' variable 
  fill  = ~factor(HealthInsurance), ## defines the variable for color
  denom = ~x,                       ## sets the x variable totals to be the denominators          
  position = "stack",
  data  = EmployedACS) +            ## The + allows us to add another command to tweak the key
scale_fill_discrete(
  name   = "Health Insurance Status",
  labels = c(`0` = "Uninsured", `1` = "Insured")
)

Quantitative Variables, Split by Group

As we’ve seen, we often want to know about proportions of the categories from one variable conditioned on (grouped by) the categories of another. We might similarly be interested in the distribution of a quantitative variable in each of several subsets of the data, where the subsets are defined by another categorical variable.

Plotting data by group

For example, we might be interested in the distribution of the number of hours worked, separating the data into married and not married individuals. We might want to look at this using boxplots. It is possible to use the same syntax we used above:

gf_boxplot(~HoursWk | Married, data = EmployedACS)

but it’s hard to compare the two plots side by side like this. The following syntax produces a similar plot but grouped in a more convenient way.

## factor() here tells R that Married is categorical. Since it (like Sex) is 
## coded as 0 or 1 in the data, R does not know this automatically.
gf_boxplot(HoursWk ~ factor(Married), data = EmployedACS)

By default, the variable before the ~ goes on the $y$-axis and the variable after the tilde goes on the $x$ axis. If we wanted horizontal boxplots like we’re more used to, we could swap their positions.

gf_boxplot(factor(Married) ~ HoursWk, data = EmployedACS)

What differences jump out between the distribution of hours worked for married vs unmarried individuals? Speculate about what might account for this.

SOLUTION to Exercise 9

Why do you suppose the dark line depicting the median is right at the edge of the box with this data? (Hint: look back at the histogram of the HoursWk variable you created above)

SOLUTION to Exercise 10

Summary statistics by group

We can use the analogous command structure to find means, medians, etc., split by group.

favstats(~HoursWk | Married, data = EmployedACS)

##   Married min Q1 median Q3 max     mean       sd   n missing
## 1       0   2 25     40 40  88 35.35972 14.04229 581       0
## 2       1   3 40     40 45  80 39.93059 11.57753 706       0

mean(~HoursWk | Married, data = EmployedACS)

##        0        1 
## 35.35972 39.93059

HOMEWORK

Compute the mean income for workers, grouped by whether they have health insurance.

SOLUTIONS to Exercise 11

Produce side-by-side boxplots with a single shared axis showing the distribution of income, grouped by whether the person has health insurance (you will want to use factor() to tell R to treat the grouping variable as categorical).

SOLUTIONS to Exercise 12

Suppose a new person (not in this dataset) filled out the survey and reported income of $80K but neglected to indicate whether they have health insurance. Based on the side-by-side boxplots, would you guess that they have health insurance or not? How confident would you be? Explain.

SOLUTIONS to Exercise 13

When you’re finished, Knit your finished .Rmd file to .html. Once you have corrected any errors you are getting and get your file to Knit successfully, save both the source file (as lab4.Rmd) and the Knitted .html output file (lab4.html), which gets automatically created, to the ~/stat113/turnin/hw4/ folder.