Take a published data visualization and recreate it step-by-step in R with ggplot2
. This lab is based on section 3.3 of the textbook Modern Data Science with R.
http://github.com/ocstats/stat209-s2021-lab5-crdawson
but with your account name at the end), and save this file in it. Stage, commit, and push the changes.The visualization we will recreate comes from this FiveThirtyEight post about the historical popularity of various baby names. We will focus on the second plot, depicting how many “Josephs” were born in each year, and how many of those are estimated to be alive today. For convenience, I’ve embedded the plot in the lab below.
The Social Security Administration (SSA) provides historical data on how many babies were born in each year with various names. The relevant data is provided in the lifetables
data frame from the babynames
package. Let’s load it now.
Code
To make life easier, the mdsr
package provides a function called make_baby_names_dist()
that does some useful preprocessing of the data so that we don’t have to deal with any data-wrangling yet. It takes no arguments and returns a modified data frame.
Code
## Rows: 1,639,722
## Columns: 9
## $ year <dbl> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1…
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ name <chr> "Mary", "Helen", "Anna", "Margaret", "Ruth", "Eli…
## $ n <int> 16706, 6343, 6114, 5304, 4765, 4096, 3920, 3896, …
## $ prop <dbl> 0.05257559, 0.01996211, 0.01924142, 0.01669226, 0…
## $ alive_prob <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ count_thousands <dbl> 16.706, 6.343, 6.114, 5.304, 4.765, 4.096, 3.920,…
## $ age_today <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 114, 114,…
## $ est_alive_today <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
The plot only involves male babies named “Joseph”, so let’s create a filtered dataset with just those births.
Code:
## Rows: 111
## Columns: 9
## $ year <dbl> 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1…
## $ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",…
## $ name <chr> "Joseph", "Joseph", "Joseph", "Joseph", "Joseph",…
## $ n <int> 3714, 2766, 3098, 3121, 3291, 3302, 3527, 3844, 4…
## $ prop <dbl> 0.02290712, 0.02392837, 0.02333710, 0.02413281, 0…
## $ alive_prob <dbl> 0.000000, 0.000025, 0.000050, 0.000075, 0.000100,…
## $ count_thousands <dbl> 3.714, 2.766, 3.098, 3.121, 3.291, 3.302, 3.527, …
## $ age_today <dbl> 114, 113, 112, 111, 110, 109, 108, 107, 106, 105,…
## $ est_alive_today <dbl> 0.000000, 0.069150, 0.154900, 0.234075, 0.329100,…
There are two geoms
in this plot: a line depicting the total number of Josephs born in each year, and a set of bars depicting the number estimated to be alive today. In both cases, the x-axis depicts the year; however, there are two different variables mapped to the y-axis. So we don’t want to set up a global mapping to the y-axis; instead we will define the y mapping at the level of each geom. At the global level, we’ll just define the x mapping.
Code
geom
sThe bar graph component of the plot is based on the estimated number of male “Joseph”s alive today from each birth year. This is approximately the number of male Josephs born that year (count_thousands
) times the probability that someone born that year is still alive now (alive_prob
). The product of these two variables has already been computed for us as est_alive_today
.
Let’s try adding a geom_bar()
with the estimated number alive today (est_alive_today
) as the height of the bars.
Attempted Code:
Oops, we get an error message.
The reason this gives an error is that the default for a bar graph is to count how often each level of the “x” variable occurs, and plot that count on the “y” axis. This is what we would want to do if each case in our data was an individual Joseph. In this type of plot, we can’t also map a variable to the y dimension; hence the error.
In our data, each case is a year, and we have the count precomputed in the est_alive_today
variable.
To tell geom_bar
to use an existing variable instead of counting cases, we can specify stat = "identity"
.
Code:
Hmm… This shows the data we want, but it’s a little cluttered to have such big numbers on the y axis. The original plot shows thousands on the y axis; let’s do the same.
Code:
That looks cleaner. Let’s try to match the colors of the original bars. The light blue fill is roughly color #b2d7e9
, and for the white outlines, we’ll just set color="white"
.
Code:
joseph_base_plot +
geom_bar(
stat = "identity",
aes(y = est_alive_today / 1000),
fill = "#b2d7e9", # fill controls the interior of the bars
color = "white" # color controls the outline of the bars
)
Ok, we should be done tinkering with the bar component, so let’s save our updated plot.
Code:
joseph_bar_plot <- joseph_base_plot +
geom_bar(
stat = "identity",
aes(y = est_alive_today / 1000),
fill = "#b2d7e9",
color = "white"
)
Now let’s add the line. This should depict the total number of births in thousands, which is calculated already in the count_thousands
variable.
Code:
We can provide more informative axis labels with the xlab()
and ylab()
functions. These are quick wrapper functions that let us easily change just the labels (for more fine-grained control over the axes, including things like the tick marks, we could use scale_x_continuous()
or scale_y_continuous()
).
Code:
joseph_labeled_plot <- joseph_bar_line_plot +
xlab("Birth Year") +
ylab("Number of People (Thousands)")
joseph_labeled_plot
Since we created the plot over the course of several assignment operations, it can be useful to summarize what the joseph_labeled_plot
object holds. By typing summary(joseph_labeled_plot)
we can review all of the geometries and aesthetic mappings that we have defined so far (this will also display a big list of nuts-and-bolts properties attached to the “faceting” element, which you can ignore)
Code:
## data: year, sex, name, n, prop, alive_prob, count_thousands,
## age_today, est_alive_today [111x9]
## mapping: x = ~year
## faceting: <ggproto object: Class FacetNull, Facet, gg>
## compute_layout: function
## draw_back: function
## draw_front: function
## draw_labels: function
## draw_panels: function
## finish_data: function
## init_scales: function
## map_data: function
## params: list
## setup_data: function
## setup_params: function
## shrink: TRUE
## train_scales: function
## vars: function
## super: <ggproto object: Class FacetNull, Facet, gg>
## -----------------------------------
## mapping: y = ~est_alive_today/1000
## geom_bar: width = NULL, na.rm = FALSE, orientation = NA, fill = #b2d7e9, colour = white, flipped_aes = FALSE, fill = #b2d7e9, colour = white
## stat_identity: na.rm = FALSE
## position_stack
##
## mapping: y = ~count_thousands
## geom_line: na.rm = FALSE, orientation = NA, flipped_aes = FALSE
## stat_identity: na.rm = FALSE
## position_identity
In the source graphic, the median age of living male Josephs is depicted both in text and with a darker colored bar. We’ll address the annotation elements in a bit; for now let’s try to (a) have R calculate the median age from the data, and (b) change the color of the bar.
If we had raw data on individual Josephs, we could simply calculate the median of an age variable. Since we instead have (estimated) counts at each birth year, we need to take a weighted median, finding the birth year for which half of the total count is on either side. Visually, we are finding the value that divides the “blue ink” in half horizontally.
A convenient function to calculate this is the wtd.quantile()
function, available in the Hmisc
package. Since it’s an older package which is not part of the tidyverse
, its syntax is unfortunately not the same as what we’re used to; in particular, it doesn’t provide a data=
argument. Instead, we can specify the context in which our variable names are embedded using the with()
function from base R, as follows:
Code:
## We do not want to load the Hmisc package in its entirety,
## because some of its functions have the same names as
## functions in the tidyverse, and we don't want to 'mask' the
## latter. Instead we can call the function directly from
## Hmisc without loading the package using the following syntax.
median_birth_year <-
with(
data = Josephs,
expr = Hmisc::wtd.quantile(
x = year,
weights = est_alive_today, # cases in each year
prob = 0.5 # the median is halfway through the cases
))
median_birth_year
## 50%
## 1975
So the median male Joseph in this dataset was born in 1975 (Side note: if you’re reading this from the .Rmd
source, the syntax with the single back quotes with an r
at the front is a way of having RMarkdown insert the value of an R object inline in a paragraph of text. Normally, backquotes format text in a monospace (“code”) font, but the leading r
tells it to evaluate it as an R command.
To add the darker blue bar at 1975, we can add a second geom_bar
to the plot. Since the x axis is mapped globally to the year
variable, we need a trick to have ggplot draw just one bar. Namely, we will define a variable which is equal to est_alive_today
if year == median.birthyear
, and zero otherwise, and map this variable to the y axis. Technically this will draw a whole new set of bars, but all but the one we care about will have height zero.
We can define this variable using the ifelse()
function.
joseph_annotated_plot <-
joseph_labeled_plot +
geom_bar(
stat = "identity",
aes(
y = ifelse(
test = year == median_birth_year, # the condition (note == instead of =)
yes = est_alive_today / 1000, # value if true (height of the dark bar)
no = 0 # value if false (0 height elsewhere)
)),
fill = "#008fd5", # the darker blue color
color = "white")
joseph_annotated_plot
Since this plot came from FiveThirtyEight, let’s see how close we can get to the look of the original by applying the fivethirtyeight
theme from ggthemes
.
library(ggthemes)
joseph_538_plot <- joseph_annotated_plot + theme_fivethirtyeight()
joseph_538_plot
Evidently this theme removes the y axis label, which I’m not crazy about, though it is closer to the original. If we’re going to do without a y axis label, we’d better be sure to supply that context elsewhere.
We can add a title and some text boxes with ggtitle()
and geom_text()
. We can draw the annotation arrow with geom_curve()
and just hardcode the locations of the endpoints.
joseph_final_plot <- joseph_538_plot +
ggtitle("Age Distribution of American Boys Named Joseph") +
geom_text(
x = 1935, y = 40,
label = "Number of Josephs\nborn each year" # the \n is a "newline"
) +
geom_text(
x = 1915, y = 13,
label = "Number of Josephs\nborn each year\nestimated to be alive\non 1/1/2014",
colour = "#b2d7e9"
) +
geom_text(
x = 2003, y = 40,
label = "The median\nliving Joseph\n in 2014 was 37 years old",
colour = "darkgray"
) +
geom_curve(
x = 1995, xend = 1974, y = 40, yend = 24,
arrow = arrow(length = unit(0.3, "cm")),
curvature = 0.5
) +
ylim(0, 42) ## change the range on the y axis to make room for annotation
joseph_final_plot
You can find out more about how to use functions like geom_curve()
and arrow()
using R’s help interface. Remember that practicing statisticians and data scientists usually don’t have all the code they use memorized; even pros have to look things up constantly.
A cool feature about ggplot
s is that they enable us to define the structure of a plot once and “swap in” a new dataset to get the analogous plot. For example, if we are interested in creating an analogous plot for males with the name “Colin” (a fine name, in my opinion), we can use the %+%
operator as follows:
plot_template <- joseph_538_plot
Colins <- filter(BabynamesDist, name == "Colin" & sex == "M")
colin_plot <- plot_template %+% Colins
colin_plot
It seems like Colin is such a youthful name that nearly everyone who has ever been born with that name in the U.S. is likely still alive.
Note that I started with the plot that didn’t have all of the extra context and annotations, since that was all specific to the name Joseph.
Something is wrong here, however. Do you know what it is?
Spoiler
Fortunately, since we defined our highlighting condition using a variable name, all we need to do is update the value of this variable before we generate the plot.
median_birth_year <-
with(
Colins,
Hmisc::wtd.quantile(
x = year,
weights = est_alive_today,
prob = 0.5)
)
## Note that we need to use the same name for median_birthyear that we used before.
## Once we redefine this variable, since the test in our ifelse() clause referred
## to this variable by name within the definition of the plot itself, the plot
## is automatically updated when we change the value of this variable.
colin_plot
Let’s add some annotation. It takes some doing to get the locations of things right, especially since Colin is so much less common than Joseph, so the scale of the plot is quite different.
annotated_colin_plot <- colin_plot +
ggtitle("Age Distribution of American Boys Named Colin") +
geom_text(
x = 1915, y = 5,
label = "Number of Colins\nborn each year" # Use \n for a line break
) +
geom_text(
x = 1915, y = 8,
label = "Number of Colins\nborn each year\nestimated to be alive\non 1/1/2014",
colour = "#b2d7e9"
) +
geom_text(
x = 1968, y = 5,
label = "The median\nliving Colin\n in 2014 was 17 years old",
colour = "darkgray"
) +
geom_curve(
x = 1977, xend = 1997, y = 5, yend = 3,
arrow = arrow(length = unit(0.3, "cm")),
curvature = -0.5
) +
ylim(0, 10) # change the range on the y axis to make room for annotation
If we are interested in comparing, say, the sex breakdown of names that are common for both male and female babies, we can easily add some facets to our plot. For example, the three most “unisex” names throughout this historical period have been “Jessie”, “Marion”, and “Jackie”. But the gender connotation of names is not constant over time, so let’s examine for these names how common each name was for each sex over time.
## The facet_grid format is: row_variable ~ column_variable
many_names_plot <- plot_template +
facet_grid(name ~ sex)
many_names_plot %+%
filter(BabynamesDist, name %in% c("Jessie", "Marion", "Jackie"))
(Note, I haven’t bothered to fix the median highlight here; it’s still showing the median for Colin)
This plot makes it easy to compare across names within a sex for a particular year, since year is lined up within a column, but we are more interested in comparing between sexes for a particular name. So let’s reverse which variable defines the rows and which defines the columns.
many_names_plot <- plot_template + facet_grid(sex ~ name)
many_names_plot %+%
filter(BabynamesDist, name %in% c("Jessie", "Marion", "Jackie"))
#lab5
channel along with a short takeaway..Rmd
and .html
files to the ~/stat209/turnin/lab5/
folder.This lab was adapted by Colin Dawson for STAT 209: Data Computing and Visualization, at Oberlin College, from a similar lab by Jordan Crouser and Ben Baumer used for SDS 192: Introduction to Data Science, at Smith College.