Reshaping data to make it “tidy”

Goal

Become comfortable recognizing when reshaping data will make it better suited to the task at hand, and learn how to do so with the gather() and spread() verbs in the tidyr package (part of the all-powerful tidyverse).

The Data

Should we try to squeeze some more insight out of the babynames data? Let’s try to squeeze some more insight out of the babynames data. At least for starters.

**Load the packages and data:**

library(tidyverse)
library(babynames)
data(babynames)    ## SSA data
data(births)       ## Census data

In the last lab, we joined the Social Security babyname data with the Census births data to produce a table that had two records of the total number of births in each year; one from each source.

Here’s the code we used to do it (below is the “full join” version).

Code:

census_births <- births
ssa_births <- babynames %>%
  rename(num_births = n) %>%
  group_by(year) %>%
  summarize(num_rows = n(), births = sum(num_births))
total_births <-
  ssa_births %>%
  full_join(census_births, by = "year")
head(total_births)
## # A tibble: 6 x 4
##    year num_rows births.x births.y
##   <dbl>    <int>    <int>    <int>
## 1  1880     2000   201482       NA
## 2  1881     1935   192696       NA
## 3  1882     2127   221534       NA
## 4  1883     2084   216945       NA
## 5  1884     2297   243463       NA
## 6  1885     2294   240854       NA

The births.x and births.y variables are not very descriptive; also we don’t care so much about the num_rows variable, so let’s do some selection (to remove num_rows) and renameing (to replace the uninformative names with informative ones).

Code:

total_births <- total_births %>%
  select(-num_rows) %>%
  rename(ssa = births.x, census = births.y)
head(total_births)
## # A tibble: 6 x 3
##    year    ssa census
##   <dbl>  <int>  <int>
## 1  1880 201482     NA
## 2  1881 192696     NA
## 3  1882 221534     NA
## 4  1883 216945     NA
## 5  1884 243463     NA
## 6  1885 240854     NA

If we want to visualize the number of births over time with two overlaid lines, we have to set the y aesthetic separately for each line, and if we want different colors, we have to specify them manually, line by line:

Code:

total_births %>%
  ggplot(aes(x = year)) +
  geom_line(aes(y = census), color = "blue") +
  geom_line(aes(y = ssa), color = "orange") +
  xlab("Year") + ylab("Total Births")
## Warning: Removed 30 rows containing missing values (geom_path).

We also don’t get an automatic legend.

For a graph like this, we’d like to be able to create an aesthetic mapping between the source of the data and the color of the line. That mapping could then be used to automatically produce a legend. But source isn’t a variable in this data; it’s distinguished between variables, not between cases.

The gather() function

Thinking about what the legend title would be if we created one gives us a clue that we need to wrangle this data into a format in which source is a variable, and the quantity we are plotting on the y-axis is stored in a single variable that we can map globally to the y aesthetic.

We can use gather() for this, as follows:

Code:

births_long <- total_births %>%
  gather(
    key = "source",   # this is what we're going to call the new variable
    value = "births", # this is where the values will go in "long" format
    census, ssa       # these are the columns we're "merging"
                      # (we could have also said -year, to merge all
                      # columns *except* year
    )
head(births_long)
## # A tibble: 6 x 3
##    year source births
##   <dbl>  <chr>  <int>
## 1  1880 census     NA
## 2  1881 census     NA
## 3  1882 census     NA
## 4  1883 census     NA
## 5  1884 census     NA
## 6  1885 census     NA

Having created the source variable and having merged all the counts into a single births variable, we can now create the line graph we want quite easily (and we get a legend automatically, since the color of the line is now a bona fide part of the aesthetic mapping)

Code:

births_long %>%
  ggplot(aes(x = year, y = births, color = source)) +
  geom_line()
## Warning: Removed 30 rows containing missing values (geom_path).

The spread() function

Is the “long” format we’ve created “better” in an absolute sense? Well, it’s better for producing the line graph we wanted, but suppose we wanted to visualize the correlation between the sources with a scatterplot. For a plot like this, we want one axis to be the number of births according to the SSA, and the other axis to be the number of births according to the Census. This was easy in the original data:

Code:

total_births %>%
  ggplot(aes(x = ssa, y = census)) +
  geom_point()
## Warning: Removed 30 rows containing missing values (geom_point).