lab4
and make a directory for it wherever you have been putting project folders.lab4.Rmd
in your project folder. Make your edits there.As you work through the lab, you should run the example chunks as you go. Use the “Run Chunk” button at the top right of the chunk.
For numbered exercises, put your text and code between the — markers and below the ##### lines. Leave a blank line before and after the ``` lines starting and ending code chunk as well as before and after the — markers. Like this:
Here’s what I think about this!
in here code
Here’s another thing I think about that!
To create a new code chunk you can either copy and paste an existing chunk, type the delimiter lines directly, or use the “Insert Chunk > R” button at the top of the editor pane (the green square with a white C). There is also a keyboard shortcut to insert a new chunk.
EmployedACS
dataset from the Lock5Data
package. You will first need to load the package with library(PackageName)
,then the dataset, with data(DatasetName)
, then finally, pull up the code book with help(DatasetName)
. The library()
and data()
commands should be in a code chunk, but the help()
command is best typed at the console so it doesn’t run when you Knit. You can refer back to Lab3 for reference if needed. Which variables are quantitative? Which are categorical?In this section we will look at how to create some basic visualizations and get some basic summary statistics in R.
We are going to be using functions from the following packages. Make sure they are loaded via the library()
command. You only need to load each package once per document (but it doesn’t hurt anything to do it again).
mosaic
tidyverse
Lock5Data
## mosaic and should be loaded in the setup chunk above already. Make sure you run it though!
## Use the "Run All Chunks Above" button at the top right of this chunk
library(Lock5Data)
Having loaded the package, we can load the dataset with data()
(you should have done this in Exercise 1, but here’s the code just in case)
data(EmployedACS) # Loading the dataset
For a quantitative variable like HoursWk
(Hours worked per week), we can get a general sense of what values are in the data using a histogram. Below is a command that produces a histogram. You should see the plot appear below the chunk.
## Pay attention to capitalization. Pretty much everything in R is case-sensitive.
gf_histogram(~HoursWk, data = EmployedACS)
Notice how the command above is structured: the function name is gf_histogram()
. Many plotting functions will start with gf_
, which is shorthand for the name of a convenient plotting package called ggformula
. The function name describes what we are getting.
Then, the first argument has the name of the variable we are graphing, preceded by a tilde (~
).
Finally, we have the named data=
argument, which is set to the name of the dataset. We will see this same pattern over and over for different objectives (plots, summary statistics, etc.).
Here’s a general template for many of the commands we will use (this is not actually runnable code):
goal(~VariableName, data = DataSetName)
We can modify the gf_histogram()
command to control how many “bins” we get, using the optional bins=
argument (set it to the desired number of bins). Alternatively we can set binwidth=
to a desired width.
gf_histogram(~HoursWk, data = EmployedACS, bins = 40)
or
gf_histogram(~HoursWk, data = EmployedACS, binwidth = 1)
You can make your plot (possibly) prettier by adding a bit of color. In this case, we use the fill=
argument to control the color that fills in the bars.
gf_histogram(~HoursWk, data = EmployedACS, bins = 40, fill = "green")
You can type colors()
at the console to see a (long) list of available color names.
Income
variable in this same dataset. Describe the shape of the distribution and estimate the mean and median income of the individuals in this data by eyeballing the plots (you will likely want to examine the documentation page for the dataset, using help()
or the ?
syntax to see how Income
is measured).We can also create a box-and-whisker plot for a quantitative variable.
gf_boxplot(~Income, data = EmployedACS)
Notice that the pattern of this command is the same as the previous ones!
Income
variable from the box-and-whisker plot.We can get a set of useful summary statistics for the HoursWk
variable using the favstats()
function. The syntax is the same as for gf_histogram()
, and gf_boxplot()
, in that we have a formula (containing a variable name preceded by ~
) and a data=
argument (containing the name of the dataset).
favstats(~HoursWk, data = EmployedACS)
## min Q1 median Q3 max mean sd n missing
## 2 35 40 42 88 37.86713 12.94576 1287 0
Income
variable. Were the estimates you made above by eyeballing the plots reasonably close?We can get individual summary statistics using the same format. For example,
mean(~HoursWk, data = EmployedACS)
## [1] 37.86713
median(~HoursWk, data = EmployedACS)
## [1] 40
sd(~HoursWk, data = EmployedACS) # Standard Deviation
## [1] 12.94576
IQR(~HoursWk, data = EmployedACS)
## [1] 7
We can summarize a categorical variable such as Race
using a frequency table. The tally()
function will do the counting for us and create the table.
tally(~Race, data = EmployedACS)
## Race
## asian black other white
## 92 116 102 977
We can get a relative frequency table instead by specifying format = "proportion"
for proportions on a 0 to 1 scale, or or format = "percent"
for percentages on a 0 to 100 scale.
tally(~Race, data = EmployedACS, format = "percent")
## Race
## asian black other white
## 7.148407 9.013209 7.925408 75.912976
pctPoC
.We can get a contingency table showing combinations of levels of two categorical variables (for example, HealthInsurance
and Race
) as follows:
tally(Race ~ HealthInsurance, data = EmployedACS, format = "count")
## HealthInsurance
## Race 0 1
## asian 6 86
## black 10 106
## other 22 80
## white 70 907
The HealthInsurance
variable is coded as 0 or 1. Refer to the code book to interpret these.
With two variables, we will often see formulas that look like this, with one variable name to the left of the tilde (~
), which will generally be the “y” variable, and the other one to the right, which is like the “x” variable.
If we want to condition on (group by) one of the variables, and see what proportion of each group is in each category of the other variable — for example, to look at what percentage of respondents of each race are male and what percentage of respondents of each race are female, we can do this as follows:
## The vertical bar is the symbol for "grouped by" or "conditioned on".
## The variable to the right is the one defining the groups.
tally(~HealthInsurance | Race, data = EmployedACS, format = "percent")
## Race
## HealthInsurance asian black other white
## 0 6.521739 8.620690 21.568627 7.164790
## 1 93.478261 91.379310 78.431373 92.835210
Notice that the percentages add up to 100 within each level of the grouping variable, but not within levels of the other variable.
We can plot counts or proportions using a bar graph (and definitely not a pie chart…)
## Counts version
gf_bar(~Race, data = EmployedACS)
## Proportions version
gf_props(~Race, data = EmployedACS)
## Percentages version
gf_percents(~Race, data = EmployedACS)
Or, with fill colors determined by the Race variable (note that having the color vary by the value of a variable uses the formula syntax too):
gf_percents(~Race, data = EmployedACS, fill = ~Race)
We can modify the graph in two ways:
HealthInsurance
is categorical using the factor()
function.xlab=
argument to the gf_bar()
function. Put the desired label in quotes as the value of the xlab=
argument so the reader knows what the bars represent:gf_props(
~factor(HealthInsurance),
data = EmployedACS,
xlab = "Health Insurance Status (1 = has insurance; 0 = does not have insurance)")
If we have a contingency table with two categorical variables, we can visualize it in a couple of ways. One is to use a grouped bar graph:
## The vertical bar means "conditioned on" or "grouped by".
## Unfortunately since HealthInsurance is encoded as 0 or 1, this is what gets displayed.
## This can be fixed, but it's not super simple, so we'll skip it for now
gf_percents(
~factor(HealthInsurance) | Race,
data = EmployedACS,
fill = ~factor(HealthInsurance))
It might be easier to see what’s going on in this graph if we use stacked bars. Getting stacks with the conditional percentages we want is a bit unintuitive, but here’s how to do it. While we’re at it I’ll fix the legend. Don’t feel you need to know how to do the tweaks below, but in case you want to, it’s here for reference.
gf_percents(
~Race, ## defines the 'x' variable
fill = ~factor(HealthInsurance), ## defines the variable for color
denom = ~x, ## sets the x variable totals to be the denominators
position = "stack",
data = EmployedACS) + ## The + allows us to add another command to tweak the key
scale_fill_discrete(
name = "Health Insurance Status",
labels = c(`0` = "Uninsured", `1` = "Insured")
)
As we’ve seen, we often want to know about proportions of the categories from one variable conditioned on (grouped by) the categories of another. We might similarly be interested in the distribution of a quantitative variable in each of several subsets of the data, where the subsets are defined by another categorical variable.
For example, we might be interested in the distribution of the number of hours worked, separating the data into married and not married individuals. We might want to look at this using boxplots. It is possible to use the same syntax we used above:
gf_boxplot(~HoursWk | Married, data = EmployedACS)
but it’s hard to compare the two plots side by side like this. The following syntax produces a similar plot but grouped in a more convenient way.
## factor() here tells R that Married is categorical. Since it (like Sex) is
## coded as 0 or 1 in the data, R does not know this automatically.
gf_boxplot(HoursWk ~ factor(Married), data = EmployedACS)
By default, the variable before the ~
goes on the \(y\)-axis and the variable after the tilde goes on the \(x\) axis. If we wanted horizontal boxplots like we’re more used to, we could swap their positions.
gf_boxplot(factor(Married) ~ HoursWk, data = EmployedACS)
HoursWk
variable you created above)We can use the analogous command structure to find means, medians, etc., split by group.
favstats(~HoursWk | Married, data = EmployedACS)
## Married min Q1 median Q3 max mean sd n missing
## 1 0 2 25 40 40 88 35.35972 14.04229 581 0
## 2 1 3 40 40 45 80 39.93059 11.57753 706 0
mean(~HoursWk | Married, data = EmployedACS)
## 0 1
## 35.35972 39.93059
factor()
to tell R to treat the grouping variable as categorical)..Rmd
file to .html
. Once you have corrected any errors you are getting and get your file to Knit successfully, save both the source file (as lab4.Rmd
) and the Knitted .html
output file (lab4.html
), which gets automatically created, to the ~/stat113/turnin/hw4/
folder.