Reshaping data for fun, profit, and “tidyness”

Goal

Become comfortable recognizing when reshaping data will make it better suited to the task at hand, and learn how to do so with the pivot_longer() and pivot_wider() verbs in the tidyr package (part of the all-powerful tidyverse).

Note: The reshaping performed by pivot_longer() and pivot_wider() was previously done by the verbs gather() and spread(), respectively. If you have the 1st edition of the textbook, you may have read about these. They still work, but have been superseded by the two pivot_ functions, which are a bit simpler to use and also more flexible.

The Data

Should we try to squeeze some more insight out of the babynames data? Let’s try to squeeze some more insight out of the babynames data. At least to start with.

Let’s make sure the relevant packages and datasets are loaded.

Code:

library(tidyverse)
library(babynames)
data(babynames)    ## SSA data
data(births)       ## Census data

Some preparatory wrangling (and review of joins, etc.)

In the last lab, we joined the Social Security babynames data with the Census births data to produce a table that had two records of the total number of births in each year; one from each source.

Here’s the code we used to do it (below is the “full join” version).

Code:

census_births <- births
ssa_births <- babynames %>%
  rename(num_births = n) %>%
  group_by(year) %>%
  summarize(
    num_rows = n(), 
    births   = sum(num_births))
total_births <-
  ssa_births %>%
  full_join(census_births, by = "year")

To make sure it worked as expected let’s take a peek at a random sample of the joined data. Remeber that since a few years are only in one dataset or the other, there will be some missing values (NA).

Code:

total_births %>%
  sample_n(size = 5)

## # A tibble: 5 x 4
##    year num_rows births.x births.y
##   <dbl>    <int>    <int>    <int>
## 1  1916     9696  1934434  2964000
## 2  1935     9037  2089596  2377000
## 3  1972    15411  3143627  3258411
## 4  1926    10458  2295937  2839000
## 5  1901     3153   345814       NA

The births.x and births.y variables are not very descriptive; also we don’t care so much about the num_rows variable, so let’s do some selection (to remove num_rows) and renameing (to replace the uninformative names with informative ones).

Code:

total_births <- total_births %>%
  select(-num_rows) %>%
  rename(
    ssa    = births.x, 
    census = births.y)

Let’s look at a random sample from this modified data.

total_births %>% sample_n(size = 5)

## # A tibble: 5 x 3
##    year     ssa  census
##   <dbl>   <int>   <int>
## 1  1888  299473      NA
## 2  1987 3604403 3809394
## 3  1993 3769282 4000240
## 4  1943 2822127 3104000
## 5  1934 2076507 2396000

Plotting birth counts by source

If we want to visualize the number of births over time from two different sources using two overlaid lines, we have to set the y aesthetic separately for each line.

Also, if we want to use a different color for each source, we have to specify them manually, line by line:

Code:

total_births %>%
  ggplot(aes(x = year)) +
  geom_line(aes(y = census), color = "blue", na.rm = TRUE) +
  geom_line(aes(y = ssa), color = "orange", na.rm = TRUE) +
  scale_x_continuous(
    name   = "Year", 
    breaks = seq(1880,2020,by=10)) + 
  scale_y_continuous(
    name   = "Total Births (Millions)", 
    breaks = seq(0,5000000,1000000),
    labels = 0:5)

We also don’t get a legend to tell us which source is which color. We could create this manually, but this is clunky and error-prone.

For a graph like this, we’d like to be able to create an aesthetic mapping between the source of the data and the color of the line. That mapping could then be used to automatically produce a legend. But source isn’t a variable in this data; it’s distinguished between variables, not between cases.

Stacking data with the `pivot_longer()` function

Thinking about what the legend title and entries would be if we created one gives us a clue about what our dataset is missing: We need a variable called something like source, and a single variable to map to the \(y\)-axis, recording the number of births from the respective source.

We can use pivot_longer() for this, as follows:

Code:

births_long <- total_births %>%
  pivot_longer(
    cols      = c("census", "ssa"), # these are the columns we're "merging"
    names_to  = "source",   # this is what we're going to call the new variable
    values_to = "births",   # this is where the stacked values will go
    )

## The earliest years are not in the census dataset
births_long %>% slice_head(n = 5)

## # A tibble: 5 x 3
##    year source births
##   <dbl> <chr>   <int>
## 1  1880 census     NA
## 2  1880 ssa    201484
## 3  1881 census     NA
## 4  1881 ssa    192696
## 5  1882 census     NA

Having created the source variable and having merged all the counts into a single births variable, we can now create the line graph we want quite easily (and we get a legend automatically, since the color of the line now comes from a variable in the data table)

Code:

births_long %>%
  ggplot(aes(x = year, y = births, color = source)) +
  geom_line(na.rm = TRUE) +
  scale_x_continuous(
    name   = "Year",
    breaks = seq(1880,2020,by=10)) +
  scale_y_continuous(
    name   = "Total Births (Millions)", 
    breaks = seq(0,5000000,1000000),
    labels = 0:5)

The `pivot_wider()` function

Is the “long” format we’ve created “better” in an absolute sense? Well, it’s better for producing the line graph we wanted, but suppose we wanted to visualize the correlation between the sources with a scatterplot. For a plot like this, we want one axis to be the number of births according to the SSA, and the other axis to be the number of births according to the Census. This was easy in the original data:

Code:

total_births %>%
  ggplot(aes(x = ssa, y = census)) +
  geom_point(na.rm = TRUE) +
  scale_x_continuous(
    name   = "Births Recorded by the SSA (Millions)",
    limits = c(0,5000000),
    breaks = seq(0,5000000, by = 1000000),
    labels = 0:5) +
  scale_y_continuous(
    name   = "Births Recorded by the Census (Millions)",
    limits = c(2000000,5000000),
    breaks = seq(2000000,5000000, by = 1000000),
    labels = 2:5)

If the data had come to us in the “long” format, however, it would be much less obvious how to create this plot. It’s also not so clear how we’d do something like compute the correlation, or the difference between the estimates in a particular year (I suppose we could use group_by() and summarize() to do this last one, but it wouldn’t be that straightforward).

There may be times when we want to go the other direction: if we want to compute or plot something that depends on ordered pairs (or ordered tuples more generally), such as computing a correlation, or creating a new variable via mutate() that depends on both entries, then it is probably easier if the coordinates of those pairs (or tuples) are stored in separate variables.

The pivot_wider() function does this:

Code:

births_wide <- births_long %>%
  pivot_wider(
    names_from  = source, # Unlike with pivot_longer(), this should name an existing variable
    values_from = births  # Unlike with pivot_longer(), this should name an existing variable
    )

Ambiguous Matches

For pivot_wider() to work, it needs to know exactly what to put in each new column for each different value of the variable named in names_from. In order for the “wide” format to be well-defined, there must be a unique mapping from each combination of the variables we aren’t reshaping (in this case just year) to a value, for each different key.

If there is more than one row with the same entries in the other variables, pivot_wider() won’t have any way of knowing which rows to merge. It will still run, but you’ll end up with a list of the values from the column named in values_from that share the same entries elsewhere in a single entry of the widened data table.

To illustrate this, let’s create a dataset that has two instances of every row by stacking a second copy of the births_long below the first one.

births_long_duplicated <- births_long %>% bind_rows(births_long)

Now let’s try to “widen” the result.

births_wide_messy <- births_long_duplicated %>%
  pivot_wider(
    names_from  = "source",
    values_from = "births"
  )

## Warning: Values are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = length` to identify where the duplicates arise
## * Use `values_fn = {summary_fun}` to summarise duplicates

View the births_wide_messy data in RStudio’s data viewer, and verify that each entry in the census and ssa columns is a list of two values rather than a single value. Probably not going to do anything nice for us later if we try to use this.

If, as is the case here, the problem is caused by actual duplicate entries in the data, we can just remove these duplicates before passing the data to pivot_wider(). There are multiple ways to do that, but an easy one is just to pass the data to the distinct() function, to remove completely redundant rows.

births_long_deduplicated <- births_long_duplicated %>%
  distinct()

Verify that the size of the resulting births_long_deduplicated table is the same as the original births_long data.

Let’s try pivot_wider() again on the deduplicated data.

births_wide <- births_long_deduplicated %>%
  pivot_wider(
    names_from  = source, 
    values_from = births)

births_wide %>% slice_head(n = 5)

## # A tibble: 5 x 3
##    year census    ssa
##   <dbl>  <int>  <int>
## 1  1880     NA 201484
## 2  1881     NA 192696
## 3  1882     NA 221533
## 4  1883     NA 216946
## 5  1884     NA 243462

Looks just like the data we started with (except for the order of the columns, and the removal of the duplicate rows)!

Now we can produce a scatterplot…

Code:

births_wide %>% 
  ggplot(aes(x = ssa, y = census)) +
  geom_point(na.rm = TRUE) +
  scale_x_continuous(
    name   = "Births Recorded by the SSA (Millions)",
    limits = c(0,5000000),
    breaks = seq(0,5000000, by = 1000000),
    labels = 0:5) +
  scale_y_continuous(
    name   = "Births Recorded by the Census (Millions)",
    limits = c(2000000,5000000),
    breaks = seq(2000000,5000000, by = 1000000),
    labels = 2:5)

(which we could do with the original data, but not with the “long” format data; here we’re just undoing what we did, but in real applications we will sometimes have data come to us in “long” format and need to convert it to “wide”)

We can also do things like compute the correlation between the sources (which also would have been difficult in the “long” form):

Code:

## The cor() function from the mosaic package has a nicer interface than the default cor()
## function, because it has a data= argument.
## The use= argument tells cor() how to handle missing data.
mosaic::cor(census ~ ssa, data = births_wide, use = "pairwise.complete")

## [1] 0.8945154

or compute and plot the amount of undercounting in the SSA data relative to the Census Code:

births_wide %>%
  mutate(percent_in_ssa = ssa / census * 100) %>%
  ggplot(aes(x = year, y = percent_in_ssa)) +
    geom_line(na.rm = TRUE) + 
    scale_x_continuous(name = "Year") +
    scale_y_continuous(name = "% of census births recorded by SSA")

There are some things we have tried to do before with the original babynames data where widening would have allowed a more concise solution.

For example, recall that the rows in the babynames dataset correspond to unique combinations of year, name, and sex. If we wanted to find the total number of births associated with a particular name in a particular year irrespective of sex, we have can use group_by() and summarize() to get the total by year.

Use filter(), group_by() and summarize() to find the number of births for each name in each year from 1950 to 2000. To reduce the size of the data (and the strain on the server), do the filter() step separately and save the resulting dataset as babynames1950to2000.

SOLUTION

babynames1950to2000 <- 
  babynames %>% filter(year %>% between(1950,2000))

### Additional code here...

This is fine, but suppose we wanted both the total and the number associated with the two recorded sexes in a single table. Previously this involved the somewhat awkward step of using ifelse() with mutate() to create two separate count columns, one for each value of sex, which contain zeroes if the row doesn’t correspond do that sex, then using group_by() and summarize() to get the totals both overall and by sex for each name in each year.

Code:

babynames1950to2000 %>%
  rename(num_births = n) %>%
  mutate(
    num_if_amab = ifelse(sex == "M", num_births, 0),
    num_if_afab = ifelse(sex == "F", num_births, 0)) %>%
  group_by(year, name) %>%
  summarize(
    num_amab     = sum(num_if_amab),
    num_afab     = sum(num_if_afab),
    total_births = sum(num_births))

## # A tibble: 820,608 x 5
## # Groups:   year [51]
##     year name    num_amab num_afab total_births
##    <dbl> <chr>      <dbl>    <dbl>        <int>
##  1  1950 Aaron        798        7          805
##  2  1950 Abagail        0        5            5
##  3  1950 Abbe           0       10           10
##  4  1950 Abbey          0        7            7
##  5  1950 Abbie          0       68           68
##  6  1950 Abbott         9        0            9
##  7  1950 Abby           5       68           73
##  8  1950 Abdul          8        0            8
##  9  1950 Abe           46        0           46
## 10  1950 Abel         170        0          170
## # … with 820,598 more rows

A more elegant solution to achieve this same thing would be to use pivot_wider() to “unstack” the values in the original n column into two columns: one for births tagged "M", and one for births tagged "F". Try to get the same result as above using this approach. Note 1: Remember that pivot_wider() needs the rows it’s merging to match exactly by default. In our case, the prop variable will throw this off, so use id_cols = c(year, name) to tell pivot_wider() to match only on year and name, and not to try to match based on prop (or anything else). This will have the effect of removing the data in prop from the results. Note 2: One of the side effects of using group_by() in the previous solution is that the names are sorted in alphabetical order within each year. To get your results in the same order as above, you may want to do arrange(year, name) on the final output to sort first by name then by year.

SOLUTION

By the way, using the “widened” data, we can quite easily produce a measure of how asymmetrically a name is used for male and female babies.

Use the wide data to find the name in each year that has the least asymmetry between male and female births, excluding names given to fewer than 100 babies overall in a given year.

SOLUTION

Exercises

Find an interesting dataset from the Gapminder repository here and download the .csv.
Upload the .csv file to RStudio (assuming you are working on the server), and read it in using read_csv() (the function with an underscore is part of the tidyverse, and tends to work better than the built-in one with a period). You will need to supply as the argument to read_csv() the path to the file relative to the directory where your .Rmd is.

CODE

The cases in the Gapminder datasets are countries. Use rename() to change the first variable name to country (since rename() expects a variable name without quotes, you may need to surround the original variable name with backticks (the same syntax used to get a code font in Markdown) if it has spaces or special characters. Pro-tip: never use spaces or special characters (other than underscores) in variable names.

Example:

mydata <- mydata %>%
  rename(country = `Bad Variable Name`)

CODE

We will convert our data to a format with exactly three columns: country, year, and value (whatever value is for your chosen dataset). Before you write any code, sketch on paper what the “tidified” data will look like. Be sure to indicate how many rows it will have.
Use pivot_longer() to convert your data into this format.

SOLUTION

The year variable may be stored as text instead of as a number, which will make mapping it to a positional visual cue challenging. Fix this using mutate(), with the help of the parse_date() function (supplied by the readr package, which is also part of the tidyverse). Type ?parse_date at the console to see how to use it if it’s not clear.

CODE

Plot your variable as a time series line graph, mapping country to color. Post your graph from Exercise 8 to the #lab8 channel.

STAT 209: Lab 9

Colin Reimer Dawson

July 8, 2021

Reshaping data for fun, profit, and “tidyness”

Goal

The Data

Some preparatory wrangling (and review of joins, etc.)

Plotting birth counts by source

Stacking data with the `pivot_longer()` function

The `pivot_wider()` function

Ambiguous Matches

SOLUTION

SOLUTION

SOLUTION

Exercises

CODE

CODE

SOLUTION

CODE

CODE

STAT 209: Lab 9

Colin Reimer Dawson

July 8, 2021

Reshaping data for fun, profit, and “tidyness”

Goal

The Data

Some preparatory wrangling (and review of joins, etc.)

Plotting birth counts by source

Stacking data with the pivot_longer() function

The pivot_wider() function

Ambiguous Matches

SOLUTION

SOLUTION

SOLUTION

Exercises

CODE

CODE

SOLUTION

CODE

CODE

Stacking data with the `pivot_longer()` function

The `pivot_wider()` function