Reshaping data for fun, profit, and “tidyness”

Goal

Become comfortable recognizing when reshaping data will make it better suited to the task at hand, and learn how to do so with the pivot_longer() and pivot_wider() verbs in the tidyr package (part of the all-powerful tidyverse).

Note: The reshaping performed by pivot_longer() and pivot_wider() was previously done by the verbs gather() and spread(), respectively. If you have the 1st edition of the textbook, you may have read about these. They still work, but have been superseded by the two pivot_ functions, which are a bit simpler to use and also more flexible.

The Data

Should we try to squeeze some more insight out of the babynames data? Let’s try to squeeze some more insight out of the babynames data. At least to start with.

Let’s make sure the relevant packages and datasets are loaded.

Code:

Some preparatory wrangling (and review of joins, etc.)

In the last lab, we joined the Social Security babynames data with the Census births data to produce a table that had two records of the total number of births in each year; one from each source.

Here’s the code we used to do it (below is the “full join” version).

Code:

To make sure it worked as expected let’s take a peek at a random sample of the joined data. Remeber that since a few years are only in one dataset or the other, there will be some missing values (NA).

Code:

## # A tibble: 5 x 4
##    year num_rows births.x births.y
##   <dbl>    <int>    <int>    <int>
## 1  1964    12396  3887800  4027490
## 2  1992    25427  3840196  4065014
## 3  1976    17391  3034949  3167788
## 4  1931     9297  2104071  2506000
## 5  1881     1935   192696       NA

The births.x and births.y variables are not very descriptive; also we don’t care so much about the num_rows variable, so let’s do some selection (to remove num_rows) and renameing (to replace the uninformative names with informative ones).

Code:

Let’s look at a random sample from this modified data.

## # A tibble: 5 x 3
##    year     ssa  census
##   <dbl>   <int>   <int>
## 1  1976 3034949 3167788
## 2  1989 3843559 4040958
## 3  2014 3696311 3988076
## 4  1999 3692537 3959417
## 5  1898  381458      NA

Plotting birth counts by source

If we want to visualize the number of births over time from two different sources using two overlaid lines, we have to set the y aesthetic separately for each line.

Also, if we want to use a different color for each source, we have to specify them manually, line by line:

Code:

We also don’t get a legend to tell us which source is which color. We could create this manually, but this is clunky and error-prone.

For a graph like this, we’d like to be able to create an aesthetic mapping between the source of the data and the color of the line. That mapping could then be used to automatically produce a legend. But source isn’t a variable in this data; it’s distinguished between variables, not between cases.

Stacking data with the pivot_longer() function

Thinking about what the legend title and entries would be if we created one gives us a clue about what our dataset is missing: We need a variable called something like source, and a single variable to map to the \(y\)-axis, recording the number of births from the respective source.

We can use pivot_longer() for this, as follows:

Code:

## # A tibble: 5 x 3
##    year source births
##   <dbl> <chr>   <int>
## 1  1880 census     NA
## 2  1880 ssa    201484
## 3  1881 census     NA
## 4  1881 ssa    192696
## 5  1882 census     NA

Having created the source variable and having merged all the counts into a single births variable, we can now create the line graph we want quite easily (and we get a legend automatically, since the color of the line now comes from a variable in the data table)

Code:

The pivot_wider() function

Is the “long” format we’ve created “better” in an absolute sense? Well, it’s better for producing the line graph we wanted, but suppose we wanted to visualize the correlation between the sources with a scatterplot. For a plot like this, we want one axis to be the number of births according to the SSA, and the other axis to be the number of births according to the Census. This was easy in the original data:

Code:

If the data had come to us in the “long” format, however, it would be much less obvious how to create this plot. It’s also not so clear how we’d do something like compute the correlation, or the difference between the estimates in a particular year (I suppose we could use group_by() and summarize() to do this last one, but it wouldn’t be that straightforward).

There may be times when we want to go the other direction: if we want to compute or plot something that depends on ordered pairs (or ordered tuples more generally), such as computing a correlation, or creating a new variable via mutate() that depends on both entries, then it is probably easier if the coordinates of those pairs (or tuples) are stored in separate variables.

The pivot_wider() function does this:

Code:

Ambiguous Matches

For pivot_wider() to work, it needs to know exactly what to put in each new column for each different value of the variable named in names_from. In order for the “wide” format to be well-defined, there must be a unique mapping from each combination of the variables we aren’t reshaping (in this case just year) to a value, for each different key.

If there is more than one row with the same entries in the other variables, pivot_wider() won’t have any way of knowing which rows to merge. It will still run, but you’ll end up with a list of the values from the column named in values_from that share the same entries elsewhere in a single entry of the widened data table.

To illustrate this, let’s create a dataset that has two instances of every row by stacking a second copy of the births_long below the first one.

Now let’s try to “widen” the result.

## Warning: Values are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = length` to identify where the duplicates arise
## * Use `values_fn = {summary_fun}` to summarise duplicates
  1. View the births_wide_messy data in RStudio’s data viewer, and verify that each entry in the census and ssa columns is a list of two values rather than a single value. Probably not going to do anything nice for us later if we try to use this.

If, as is the case here, the problem is caused by actual duplicate entries in the data, we can just remove these duplicates before passing the data to pivot_wider(). There are multiple ways to do that, but an easy one is just to pass the data to the distinct() function, to remove completely redundant rows.

  1. Verify that the size of the resulting births_long_deduplicated table is the same as the original births_long data.

Let’s try pivot_wider() again on the deduplicated data.

## # A tibble: 5 x 3
##    year census    ssa
##   <dbl>  <int>  <int>
## 1  1880     NA 201484
## 2  1881     NA 192696
## 3  1882     NA 221533
## 4  1883     NA 216946
## 5  1884     NA 243462

Looks just like the data we started with (except for the order of the columns, and the removal of the duplicate rows)!

Now we can produce a scatterplot… Code:

(which we could do with the original data, but not with the “long” format data; here we’re just undoing what we did, but in real applications we will sometimes have data come to us in “long” format and need to convert it to “wide”)

We can also do things like compute the correlation between the sources (which also would have been difficult in the “long” form):

Code:

## [1] 0.8945154

or compute and plot the amount of undercounting in the SSA data relative to the Census Code:

There are some things we have tried to do before with the original babynames data where widening would have allowed a more concise solution.

For example, recall that the rows in the babynames dataset correspond to unique combinations of year, name, and sex. If we wanted to find the total number of births associated with a particular name in a particular year irrespective of sex, we have can use group_by() and summarize() to get the total by year.

  1. Use filter(), group_by() and summarize() to find the number of births for each name in each year from 1950 to 2000. To reduce the size of the data (and the strain on the server), do the filter() step separately and save the resulting dataset as babynames1950to2000.

SOLUTION
## # A tibble: 820,608 x 3
## # Groups:   year [51]
##     year name    num_births
##    <dbl> <chr>        <int>
##  1  1950 Aaron          805
##  2  1950 Abagail          5
##  3  1950 Abbe            10
##  4  1950 Abbey            7
##  5  1950 Abbie           68
##  6  1950 Abbott           9
##  7  1950 Abby            73
##  8  1950 Abdul            8
##  9  1950 Abe             46
## 10  1950 Abel           170
## # … with 820,598 more rows

This is fine, but suppose we wanted both the total and the number associated with the two recorded sexes in a single table. Previously this involved the somewhat awkward step of using ifelse() with mutate() to create two separate count columns, one for each value of sex, which contain zeroes if the row doesn’t correspond do that sex, then using group_by() and summarize() to get the totals both overall and by sex for each name in each year.

Code:

## # A tibble: 820,608 x 5
## # Groups:   year [51]
##     year name    num_amab num_afab total_births
##    <dbl> <chr>      <dbl>    <dbl>        <int>
##  1  1950 Aaron        798        7          805
##  2  1950 Abagail        0        5            5
##  3  1950 Abbe           0       10           10
##  4  1950 Abbey          0        7            7
##  5  1950 Abbie          0       68           68
##  6  1950 Abbott         9        0            9
##  7  1950 Abby           5       68           73
##  8  1950 Abdul          8        0            8
##  9  1950 Abe           46        0           46
## 10  1950 Abel         170        0          170
## # … with 820,598 more rows
  1. A more elegant solution to achieve this same thing would be to use pivot_wider() to “unstack” the values in the original n column into two columns: one for births tagged "M", and one for births tagged "F". Try to get the same result as above using this approach. Note 1: Remember that pivot_wider() needs the rows it’s merging to match exactly by default. In our case, the prop variable will throw this off, so use id_cols = c(year, name) to tell pivot_wider() to match only on year and name, and not to try to match based on prop (or anything else). This will have the effect of removing the data in prop from the results. Note 2: One of the side effects of using group_by() in the previous solution is that the names are sorted in alphabetical order within each year. To get your results in the same order as above, you may want to do arrange(year, name) on the final output to sort first by name then by year.

SOLUTION
## # A tibble: 10 x 5
##     year name    num_afab num_amab total_births
##    <dbl> <chr>      <int>    <int>        <int>
##  1  1950 Aaron          7      798          805
##  2  1950 Abagail        5        0            5
##  3  1950 Abbe          10        0           10
##  4  1950 Abbey          7        0            7
##  5  1950 Abbie         68        0           68
##  6  1950 Abbott         0        9            9
##  7  1950 Abby          68        5           73
##  8  1950 Abdul          0        8            8
##  9  1950 Abe            0       46           46
## 10  1950 Abel           0      170          170

By the way, using the “widened” data, we can quite easily produce a measure of how asymmetrically a name is used for male and female babies.

  1. Use the wide data to find the name in each year that has the least asymmetry between male and female births, excluding names given to fewer than 100 babies overall in a given year.

SOLUTION
## # A tibble: 54 x 8
## # Groups:   year [51]
##     year name  num_afab num_amab total_births prop_amab prop_afab asymmetry
##    <dbl> <chr>    <int>    <int>        <int>     <dbl>     <dbl>     <dbl>
##  1  1950 Kris       104      105          209     0.502     0.498  0.00478 
##  2  1951 Kris       149      145          294     0.493     0.507  0.0136  
##  3  1952 Pat        300      328          628     0.522     0.478  0.0446  
##  4  1953 Lave…      103      103          206     0.5       0.5    0       
##  5  1953 Tracy      326      326          652     0.5       0.5    0       
##  6  1954 Pat        291      310          601     0.516     0.484  0.0316  
##  7  1955 Dana      1569     1568         3137     0.500     0.500  0.000319
##  8  1956 Shawn      548      567         1115     0.509     0.491  0.0170  
##  9  1957 Kelly     1906     1868         3774     0.495     0.505  0.0101  
## 10  1958 Cour…       71       78          149     0.523     0.477  0.0470  
## # … with 44 more rows

Exercises

  1. Find an interesting dataset from the Gapminder repository here and download the .csv.

  2. Upload the .csv file to RStudio (assuming you are working on the server), and read it in using read_csv() (the function with an underscore is part of the tidyverse, and tends to work better than the built-in one with a period). You will need to supply as the argument to read_csv() the path to the file relative to the directory where your .Rmd is.


CODE

  1. The cases in the Gapminder datasets are countries. Use rename() to change the first variable name to country (since rename() expects a variable name without quotes, you may need to surround the original variable name with backticks (the same syntax used to get a code font in Markdown) if it has spaces or special characters. Pro-tip: never use spaces or special characters (other than underscores) in variable names.

Example:


CODE

  1. We will convert our data to a format with exactly three columns: country, year, and value (whatever value is for your chosen dataset). Before you write any code, sketch on paper what the “tidified” data will look like. Be sure to indicate how many rows it will have.

  2. Use pivot_longer() to convert your data into this format.


SOLUTION

  1. The year variable may be stored as text instead of as a number, which will make mapping it to a positional visual cue challenging. Fix this using mutate(), with the help of the parse_date() function (supplied by the readr package, which is also part of the tidyverse). Type ?parse_date at the console to see how to use it if it’s not clear.

CODE

  1. Plot your variable as a time series line graph, mapping country to color. Post your graph from Exercise 8 to the #lab8 channel.

CODE