ggplot2
stat209
folder..Rmd
file (use Save As…) in that project foldermessage = FALSE
in the braces to suppress info from R as when loading packageswarning = FALSE
if you are getting a warning that you’re convinced isn’t a problem, and you don’t want it to be displayed (but be sure first!)echo = FALSE
if you don’t want the code to show up in the outputeval = FALSE
if you want the code to be displayed but not runresults = 'hide'
if you want the code to be run but don’t want the results to be displayedcache = TRUE
for time-consuming chunks so they don’t re-run every timeWith Markdown documents: When Knitting the entire document (unlike when running individual chunks), the process does not have access to objects in your environment: only objects defined within the document are available.
This is useful, since if you left something undefined you’ll usually get an error. But you will not always want to Knit the entire document, since this can take time; be careful when running one chunk at a time, since this can use objects in your environment.
ggplot2
Goal: By the end of this lab, you will be able to use ggplot2
to build some basic data graphics
Before we can use a library like ggplot2
, we have to load it. In this case, we load the tidyverse
package, which automatically loads ggplot2
for us (since it depends on it).
tidyverse
package using the library()
function. Adjust the chunk options to suppress message
s. Knit the document.Note: Remember, you shouldn’t copy and paste code directly from the web. Type it out yourself so that you slow yourself down a bit to process what you’re reading, and to develop your muscle memory.
Click the “Code” button on the Knitted version of this lab on the course website to see my solution, but not until after you’ve written yours!
ggplot2
?Advantages of ggplot2
:
theme
system to polish plot appearance (more on this later)The big idea: independenly specify plot building blocks and combine them to create just about any kind of graphical display you want. Building blocks of a graph include:
data=
)aes()
)geom_*()
)Using ggplot2
, we can specify different parts of the plot, and combine them together using the +
operator.
Let’s start by looking at some data on housing prices:
## Rows: 7,803
## Columns: 11
## $ State <chr> "AK", "AK", "AK", "AK", "AK", "AK", "AK", "AK", …
## $ region <chr> "West", "West", "West", "West", "West", "West", …
## $ Date <dbl> 2010.25, 2010.50, 2009.75, 2010.00, 2008.00, 200…
## $ Home.Value <dbl> 224952, 225511, 225820, 224994, 234590, 233714, …
## $ Structure.Cost <dbl> 160599, 160252, 163791, 161787, 155400, 157458, …
## $ Land.Value <dbl> 64352, 65259, 62029, 63207, 79190, 76256, 72906,…
## $ Land.Share..Pct. <dbl> 28.6, 28.9, 27.5, 28.1, 33.8, 32.6, 31.3, 29.9, …
## $ Home.Price.Index <dbl> 1.481, 1.484, 1.486, 1.481, 1.544, 1.538, 1.534,…
## $ Land.Price.Index <dbl> 1.552, 1.576, 1.494, 1.524, 1.885, 1.817, 1.740,…
## $ Year <dbl> 2010, 2010, 2009, 2009, 2007, 2008, 2008, 2008, …
## $ Qrtr <dbl> 1, 2, 3, 4, 4, 1, 2, 3, 4, 1, 2, 2, 3, 4, 1, 2, …
(Data originally from https://www.lincolninst.edu/subcenters/land-values/land-prices-by-state.asp, via Jordan Crouser at Smith College)
geom
)Geometric objects or geoms
are the actual marks we put on a plot. Examples include:
geom_point()
, for scatter plots, dot plots, etc.)geom_line()
, for time series, trend lines, etc.)geom_boxplot()
, for, um…)among others
A plot must have at least one geom
, but you can combine multiple geom
s in a single plot. Remember that you can add elements to an existing plot using the +
operator (elements can be chained together in a single command, or intermediate plots can be assigned to a variable and added to later).
You can see a list of the geom_*()
functions in ggplot2
using the following command:
In RStudio, if you simply type geom_
and then press the tab key, you will see a dropdown list of possible ways to complete the text. This is a useful trick generally, to save repetitive typing. Once you have completed a function name and typed the open paren (
, tab will also show you a list of valid argument names for that function.
aes
)In ggplot2
, aesthetic means “something you can see”. Each aesthetic is a mapping between a visual cue and a variable. For example, we can map variables to the following cues:
Each type of geom
accepts only a subset of all aesthetics — refer to the help pages of individual geom_()
functions to see what mappings each geom
accepts. Aesthetic mappings are set with the aes()
function.
Now that we know about geometric objects and aesthetic mapping, we’re ready to make out first ggplot
: a scatterplot. We’ll use geom_point
to do this, which requires aes
mappings for x
and y
. Other mappings (such as color) are optional.
Example
%>%
) operatorIn the filter()
command above, the function filter
takes the existing dataset called housing
, and extracts only those cases where the entry in the Date
column is equal to "2013.25"
, returning the result in a new dataset object, which we give the label hp2013Q1
.
The function took two arguments: the dataset, and a logical condition that serves as the “filter”; only letting through cases that meet a certain criterion, and returned a dataset.
Often times we will perform operations like this, which take a dataset as an argument, and return a modified dataset, in sequence, “chaining” them together. One of the packages in the tidyverse
augments the R language itself with an additional operator called the pipe operator (written as %>%
).
Instead of writing
as we did above, we can instead “pipe” the data into the filter, writing the command as follows.
What’s happening here is the housing
dataset is “fed through the pipe”, and passed on to the filter()
function as its first argument. This whole expression then returns the 2013 Q1 subset of the data.
ggplot
function.The output of the ggplot()
function is an object. Since we want to modify the plot that we created above, it’s helpful to store the plot object in a named variable.
To actually show the plot, we just print it, as we would print the value of a numeric value or a data frame.
Notice that although the axes are set up and labeled, there’s no data being depicted. That’s because we haven’t specified any geom
s – in other words, we haven’t told R what we actually want it to draw. However, the aesthetic mapping is defined, and if we take this base plot and add geom
s to it, the resulting plots will use the mapping that we defined in base_plot
.
Let’s add some points!
home_value_plot
.A plot constructed with ggplot
can have more than one geom
. For example, we could connect all of the points using geom_line()
. By default, the aesthetic mapping defined in the base plot is carried over to any new geom
s that we add. Note that now we see both points and lines.
geom_line()
in this case? Do the lines help us understand the connections between the observations? What do the lines represent?(this one’s text, not code)
Not all geometric objects are simple shapes – geom_smooth()
includes both a line and a “ribbon”, where the line is a “smoothed” moving average of the y variable, and the band is a 95% confidence band, represent our uncertainty about what the moving average actually would be if we had infinite data.
Other smoothing methods and band definitions are available too. You can find out more about the various options by looking at the documentation page for geom_smooth()
.
Each geom
accepts a particular set of aesthetics (i.e., mappings) – for example, geom_text()
accepts a labels
mapping. This mapping wasn’t defined in the base plot, so we can add it here.
Note that in the following plot we are not using geom_point()
or geom_line()
– we have only geom_text()
since we only want the state labels to be drawn, not points or lines.
Note that the variables are mapped to aesthetics with the aes()
function, while stylization that doesn’t express an aspect of the data is set outside the aes()
call. This sometimes leads to confusion, as in this example:
base_plot +
geom_point(
aes(size = 2), # not what you want, because 2 is not a variable
color = 'red') # this just turns all points red)
The aes()
function can also be used outside of a call to a geom
. Here we update the base_plot
with an additional mapping, assigning the color cue to the home value variable.
home_value_plot
, map color to the cost of the structure, and show your scatterplot.Other aesthetics are mapped in the same way as x
and y
in the previous example.
Aesthetic mapping (i.e., with aes()
) only says that a variable should be mapped to an aesthetic. It doesn’t say how that should happen. For example, when mapping a variable (say, z
) to shape with aes(shape = z)
, you don’t say what shapes should be used. Similarly, aes(color = z)
doesn’t say what colors should be used. Describing what colors/shapes/sizes, etc. to use is done by modifying the corresponding scale. In ggplot2
, scales
include:
position
color
, fill
, and alpha
(these control the “outer” and “inner” colors and the opacity (from transparent at 0 to opaque at 1), respectively, of the geometric objects)size
shape
linetype
Scales are modified with a series of functions using a scale_<aesthetic>_<type>
naming template. Try typing scale_
followed by the tab key to see a list of scale modification functions.
The following arguments are common to most scales
in ggplot2
:
name
: the first argument specifies the axis or legend titlelimits
: the minimum and maximum of the scale
breaks
: the points along the scale where labels should appearlabels
: the text that appears at each breakSpecific scale functions may have additional arguments; for example, the scale_color_continuous()
function has arguments low
and high
for setting the colors at the low and high end of the scale.
Let’s start by constructing a dot plot to show the distribution of home values by Date
and State
.
base_home_plot <- housing %>%
ggplot(aes(y = State, x = Home.Price.Index)) +
geom_point(
aes(color = Date),
alpha = 0.3,
size = 1.5,
position = position_jitter(width = 0, height = 0.25))
First, let’s change the label on the vertical axis.
Now let’s modify the breaks
and labels
for the x
axis and color scales:
labeled_home_plot +
scale_color_continuous(breaks = c(1975.25, 1994.25, 2013.25),
labels = c(1971, 1994, 2013))
Now let’s change the low and high values to blue
and red
for a plot that’s a bit more dramatic:
ggplot2
has a wide variety of color scales
; here is an example using scale_color_gradient2()
to interpolate between three different colors:
labeled_home_plot +
scale_color_gradient2(
breaks = c(1975.25, 1994.25, 2013.25),
labels = c(1971, 1994, 2013),
low = "blue", high = "red", mid = "gray60",
midpoint = 1994.25)
geom_vline()
to add a dotted, black, vertical line to the plot we created above (use the Help page for geom_vline()
to figure out its syntax)ggplot2
are added sequentially. See if you can figure out how to put the dotted vertical line you created in the previous exercise behind the data values.Note: In RStudio, you can type scale_
followed by TAB to get the whole list of available scales.
The idea behind faceting is to create separate graphs for subsets of the data, and tile those graphs in a manner that makes it easy to visually compare them.
ggplot2
offers two functions for creating facets:
facet_wrap()
: define subsets as the levels of a single grouping variable, and tile the resulting plots in one dimension, “wrapping around” as needed.facet_grid()
: define subsets as the crossing of two grouping variablesBy splitting data into separate subplots it is possible to keep the amount of clutter in a single plot under control, while keeping all the information in easy visual proximity to facilitate comparison among plots.
Let’s start by using a technique we already know: map State
to color
:
state_plot <- housing %>%
ggplot(aes(x = Date, y = Home.Value))
state_plot +
geom_line(aes(color = State))
This plot is horrendous. There are two problems: the distinctions among colors are too fine-grained to be able to see them, and the lines obscure each other.
We can fix the previous plot by faceting by State
rather than mapping State
to color
:
Notice the tilde (~
) syntax before the variable name. This is a convention borrowed from the syntax R uses to define regression models, and which the lattice
graphics package (as well as the mosaic
package) use to define plots.
The facet_grid()
function can be used to create facets that vary according to two grouping variables. Its syntax is
where y
and x
are the names of grouping variables that define the rows (vertical) and columns (horizontal) of the faceted grid, respectively.
facet_wrap
and/or facet_grid
to create a data graphic of your choice that illustrates something interesting about home prices. Add some text below the graph describing what it reveals..Rmd
and Knitted .html
files to the path ~/stat209/turnin/lab3
on the RStudio server.This lab is based on the “Introduction to R Graphics with ggplot2
” workshop, which is a product of the Data Science Services team Harvard University. The original source is released under a Creative Commons Attribution-ShareAlike 4.0 Unported. This lab was adapted for SDS192: and Introduction to Data Science in Spring 2017 by R. Jordan Crouser at Smith College, and further adapted for STAT209: Data Computing and Visualization by Colin Dawson at Oberlin College.