This should help you slow down, read carefully, pay attention to detail, and make sure you understand what you’re reading and writing.
lab2
and make a directory for it inside your stat113
folder.Last week we stored our commands in a file called a script, which was just a text file with a command (or a comment) on each line. This week we will use a format called RMarkdown that helps us keep our work better organized. It is basically like a Word document with some simple plain text syntax to control formatting, with mini R scripts interspersed, each followed by whatever output it produces (it is also what I use to make these lab documents).
RMarkdown-template.Rmd
in your stat113
folder.lab2.Rmd
in your project directory..Rmd
file to get a basic sense of how RMarkdown works..Rmd
file as instructed in the template. This will help you keep things organized.Don’t move on until you have read through the .Rmd
template and understand where and how to add your work to it!
As you work through the commands listed in this packet, you should type the
commands in your `.Rmd` file. Make a new code chunk for each chunk of code
as it appears here, and run that chunk using the "Run Chunk" button at the
upper right of the shaded region.
Any text that you write that isn't code can go in your .Rmd file *not* in
a code chunk. Unlike in an R Script, you do not need to precede verbal
commentary with the # symbol, as long as it is not within a chunk.
EmployedACS
dataset from the Lock5Data
package, including the description of the values of each variable. Which ones are quantitative? Which are categorical? Which are binary?In this section we will look at how to create some basic visualizations and get some basic summary statistics in R.
We are going to be using functions from the following packages, so if you haven’t already, load them at the start of your script (pay attention to capitalization):
mosaic
ggformula
Lock5Data
library(mosaic) # Needed for the various functions
library(ggformula) # Needed for some plotting functions
library(Lock5Data) # Needed for the EmployedACS data
Now, load the EmployedACS
dataset from the Lock5Data
package. Since the package is already loaded, we can go straight to the data()
line.
R Tip:
As a rule, any time you start a new RMarkdown file, you should load any
libraries you need in a chunk near the top of the file. The template
I provided loads mosaic and ggformula for you already. Don't forget
to load any datasets you need, using the method appropriate to the
source of the data, (data() for datasets from packages,
read.file() for data from files).
For a quantitative variable like HoursWk
(Hours worked per week), we can get a general sense of what values are in the data using a dot plot or a histogram.
## The gf_ prefix indicates that we are using a plot function
## from the ggformula package, which gives us a consistent template across
## different plot types.
## Unfortunately with this particular plotting function, the y axis does not
## show the counts.
gf_dotplot(~HoursWk, data = EmployedACS)
You should see the plot appear in the lower-right “Plots” pane. Dots represent cases, and they are stacked on an axis representing hours worked. It is the same idea as a histogram, but depicting individual data points.
Let’s examine the structure of the command you just typed. The first argument here (with the \(\sim\) in front) is a formula, with the name of the variable we are interested in plotting appearing on the right hand side of the tilde (\(\sim\)) symbol. We will see this format appear over and over again.
We will also see the data=
argument appear over and over again. This indicates what data frame the variable we are interested in is in.
Likely the stacks of dots are rounded off at weird intervals, and the dots are an odd size. We can modify the gf_dotplot()
command to control the rounding that gets done and the size of the dots with the binwidth=
and dotsize=
arguments, respectively.
This says to group the dots in bins spanning a range of 5 hours, and to magnify the default dot size by a factor of 3. Alternatively, we could use nint=
instead of width=
to control the number of bins instead of the width of each bin.
Alternatively, we could create a histogram:
You should again see a horizontal axis representing number of hours worked, but now instead of individual dots, there are bars. The height of each bar represents the number of cases in a particular range.
Note that the command has similar structure as dotPlot()
: we have a “formula” giving the variable name we are plotting, a data=
argument specifying the dataset that variable is in. Some of the optional arguments have different names, and we also have a type=
argument to control whether we’re plotting counts or density on the \(y\)-axis.
You can make your plot (possibly) prettier by adding a bit of color. In this case, we use the fill=
argument to control the color that fills in the bars.
You can type colors()
at the console to see a (long) list of available colors.
Alternatively you can use a predefined color palette and then refer to colors by number. For example, try this:
library("RColorBrewer") # a package with some decent palettes
display.brewer.all() # Shows names of some nice color palettes
## Sets the palette to the first 8 colors in "Set1"
## Type ?brewer.pal to see other palette options
palette(brewer.pal(n = 8, name = "Set1"))
Note:
When you run the above chunk, the palette image will be squashed vertically.
To be able to read the palette names, you can "pop out" the image with the
"Show in New Window" button at the upper right of the image. To change
the dimensions of the image in your Knitted document, you can modify the
chunk options with the "Gear" button. Choose "Use custom figure size",
and experiment with measurements until you get a size that looks good.
Now if we give colors by number it will use the palette (note, however, that if we pick a value for n
which is less than the number of colors in the palette, it may not choose the first n
colors; instead, it will choose n
colors that are roughly equally spaced across the palette range).
Income
variable in this dataset. Describe the shape of the distribution and estimate the mean and median income of the individuals in this data by eyeballing the plots (you will likely want to examine the documentation for the dataset to see how income is measured).We can also create a box-and-whisker plot for a quantitative variable, which depicts the median, the first quartile (the value that has 25% of the cases below it, in other words the 25th percentile, or the “median of the lower half of the cases”), and the third quartile (the value with 3/4 or 75% of the data below it, which is the median of the upper half of cases).
The “whiskers” extend out from the box, to a width at most 1.5 times the width of the box. Points beyond the whiskers are depicted individually, identified as potential outliers
We’ll talk about all this more formally in class tomorrow.
## By default gf_boxplot() makes vertical boxplots. But to be consistent with
## the orientation of our histograms, we'd prefer horizontal. The following
## code modifies the plot by flipping the coordinates.
## The %>% is called a "pipe": it sends the output of one command into
## the input of the next command.
gf_boxplot(~Income, data = EmployedACS) %>%
gf_refine(coord_flip())
Note that what is now the \(y\) axis here has meaningless tick marks. Set that aside for now.
Income
variable from the box-and-whisker plot.We can get a set of useful summary statistics for our HoursWk
variable using the favstats()
function. The syntax is the same as for gf_dotplot()
and gf_histogram()
, in that we have a formula and a data=
argument.
Income
variable. Were the estimates you made above by eyeballing the plots reasonably close?We can get individual summary statistics using the same format. For example,
mean(~HoursWk, data = EmployedACS)
median(~HoursWk, data = EmployedACS)
sd(~HoursWk, data = EmployedACS) # Standard Deviation
var(~HoursWk, data = EmployedACS) # Variance
If you don’t know what standard deviation and variance are, don’t worry; we will talk about them tomorrow too.
Mean, median, etc. do not make sense for a categorical variable, since we cannot add the values together, and they are not ordered.
Instead, we can summarize the variable with a frequency table:
Or, we can get a relative frequency table by specifying format = "proportion"
2 + 2
).We can get a contingency table showing combinations of levels of two categorical variables (for example, Sex and Race) as follows
If we want to condition on (group by) one of the variables, and see what proportion of each group is in each category of the other variable, we can do this as follows:
## The vertical bar is the symbol for "grouped by" or "conditioned on".
## The variable to the right is the one defining the groups.
tally(~Sex | Race, data = EmployedACS, format = "proportion")
What contrasts jump out at you, looking at this table?
Construct a table of conditional proportions to help you answer the following question: How does the proportion of workers without health insurance differ between U.S. citizens and non-U.S. citizens in this data?
We can plot counts or proportions using a bar graph (and definitely not a pie chart…)
## Counts version
gf_bar(~Race, data = EmployedACS)
## Proportions version
gf_props(~Race, data = EmployedACS)
## Percentages version
gf_percents(~Race, data = EmployedACS)
Or, with fill colors determined by the Race variable:
(Note that to have fill color depend on a variable, we use formula syntax here, rather than providing a single fixed color name in quotes)
Modify your graph by adding a label for the \(x\)-axis by adding an xlab=
argument to the plot function (you might need to refer back to Lab 1 to remind yourself how arguments work). Put the desired label in quotes as the argument value so the reader knows what the bars represent.
If we have a contingency table with two categorical variables, we can visualize it in a couple of ways. One is to use a grouped bar graph:
As we’ve seen, we often want to know about proportions of the categories from one variable conditioned on (grouped by) the categories of another. We might similarly be interested in the distribution of a quantitative variable in each of several subsets of the data, where the subsets are defined by another categorical variable.
For example, we might be interested in the distribution of the number of hours worked, separating the data into married and not married individuals. We might want to look at this using boxplots. It is possible to use the same syntax we used above:
but it’s hard to compare the two plots side by side like this. The following syntax produces a similar plot but grouped in a more convenient way.
## We use factor() here to tell R that Married is categorical. Since it (like Sex) is
## coded as 0 or 1 in the data, R does not know this automatically.
gf_boxplot(HoursWk ~ factor(Married), data = EmployedACS) %>%
gf_refine(coord_flip())
HoursWk
variable you created above)We can use the same command structure to find means, medians, etc. split by group.
Compute the mean income for workers with and without health insurance.
Produce side-by-side boxplots with a single shared axis showing the distribution of income, grouped by whether the person has health insurance (you will want to use factor()
to tell R to treat the grouping variable as categorical).
Suppose a new person (not in this dataset) filled out the survey and reported income of $80K but neglected to indicate whether they have health insurance. Based on the side-by-side boxplots, would you guess that they have health insurance or not? How confident would you be? Explain.
When you’re finished, Knit your finished .Rmd
file. Once you have corrected any errors you are getting and get your file to Knit successfully, save both the source file (as lab2.Rmd
) and the Knitted .html
output file (lab2.html
) that gets automatically created to the ~/stat113/turnin/lab2/
folder. Only one file per partnership is needed if you completed the whole lab together. If you run out of time in class and are unable to meet to finish together, the typist can send a copy of this file to their partner, and then both can continue and each submit a file which is partly identical.
If you forget how to download a file from the server so that you can send it to your partner, see Lab 1 (section “Files, Plots, Packages, Help Viewer”) for a reminder.