Become comfortable recognizing when reshaping data will make it better suited to the task at hand, and learn how to do so with the pivot_longer()
and pivot_wider()
verbs in the tidyr
package (part of the all-powerful tidyverse
).
Note: The reshaping performed by pivot_longer()
and pivot_wider()
was previously done by the verbs gather()
and spread()
, respectively. If you have the 1st edition of the textbook, you may have read about these. They still work, but have been superseded by the two pivot_
functions, which are a bit simpler to use and also more flexible.
Should we try to squeeze some more insight out of the babynames
data? Let’s try to squeeze some more insight out of the babynames
data. At least to start with.
Let’s make sure the relevant packages and datasets are loaded.
Code:
In the last lab, we joined the Social Security babynames
data with the Census births
data to produce a table that had two records of the total number of births in each year; one from each source.
Here’s the code we used to do it (below is the “full join” version).
Code:
census_births <- births
ssa_births <- babynames %>%
rename(num_births = n) %>%
group_by(year) %>%
summarize(
num_rows = n(),
births = sum(num_births))
total_births <-
ssa_births %>%
full_join(census_births, by = "year")
To make sure it worked as expected let’s take a peek at a random sample of the joined data. Remeber that since a few years are only in one dataset or the other, there will be some missing values (NA
).
Code:
## # A tibble: 5 x 4
## year num_rows births.x births.y
## <dbl> <int> <int> <int>
## 1 1964 12396 3887800 4027490
## 2 1992 25427 3840196 4065014
## 3 1976 17391 3034949 3167788
## 4 1931 9297 2104071 2506000
## 5 1881 1935 192696 NA
The births.x
and births.y
variables are not very descriptive; also we don’t care so much about the num_rows
variable, so let’s do some select
ion (to remove num_rows
) and rename
ing (to replace the uninformative names with informative ones).
Code:
Let’s look at a random sample from this modified data.
## # A tibble: 5 x 3
## year ssa census
## <dbl> <int> <int>
## 1 1976 3034949 3167788
## 2 1989 3843559 4040958
## 3 2014 3696311 3988076
## 4 1999 3692537 3959417
## 5 1898 381458 NA
If we want to visualize the number of births over time from two different sources using two overlaid lines, we have to set the y
aesthetic separately for each line.
Also, if we want to use a different color for each source, we have to specify them manually, line by line:
Code:
total_births %>%
ggplot(aes(x = year)) +
geom_line(aes(y = census), color = "blue", na.rm = TRUE) +
geom_line(aes(y = ssa), color = "orange", na.rm = TRUE) +
scale_x_continuous(
name = "Year",
breaks = seq(1880,2020,by=10)) +
scale_y_continuous(
name = "Total Births (Millions)",
breaks = seq(0,5000000,1000000),
labels = 0:5)
We also don’t get a legend to tell us which source is which color. We could create this manually, but this is clunky and error-prone.
For a graph like this, we’d like to be able to create an aesthetic mapping between the source
of the data and the color
of the line. That mapping could then be used to automatically produce a legend. But source
isn’t a variable in this data; it’s distinguished between variables, not between cases.
pivot_longer()
functionThinking about what the legend title and entries would be if we created one gives us a clue about what our dataset is missing: We need a variable called something like source
, and a single variable to map to the \(y\)-axis, recording the number of births from the respective source.
We can use pivot_longer()
for this, as follows:
Code:
births_long <- total_births %>%
pivot_longer(
cols = c("census", "ssa"), # these are the columns we're "merging"
names_to = "source", # this is what we're going to call the new variable
values_to = "births", # this is where the stacked values will go
)
## # A tibble: 5 x 3
## year source births
## <dbl> <chr> <int>
## 1 1880 census NA
## 2 1880 ssa 201484
## 3 1881 census NA
## 4 1881 ssa 192696
## 5 1882 census NA
Having created the source
variable and having merged all the counts into a single births
variable, we can now create the line graph we want quite easily (and we get a legend automatically, since the color of the line now comes from a variable in the data table)
Code:
pivot_wider()
functionIs the “long” format we’ve created “better” in an absolute sense? Well, it’s better for producing the line graph we wanted, but suppose we wanted to visualize the correlation between the sources with a scatterplot. For a plot like this, we want one axis to be the number of births according to the SSA, and the other axis to be the number of births according to the Census. This was easy in the original data:
Code:
total_births %>%
ggplot(aes(x = ssa, y = census)) +
geom_point(na.rm = TRUE) +
scale_x_continuous(
name = "Births Recorded by the SSA (Millions)",
limits = c(0,5000000),
breaks = seq(0,5000000, by = 1000000),
labels = 0:5) +
scale_y_continuous(
name = "Births Recorded by the Census (Millions)",
limits = c(2000000,5000000),
breaks = seq(2000000,5000000, by = 1000000),
labels = 2:5)
If the data had come to us in the “long” format, however, it would be much less obvious how to create this plot. It’s also not so clear how we’d do something like compute the correlation, or the difference between the estimates in a particular year (I suppose we could use group_by()
and summarize()
to do this last one, but it wouldn’t be that straightforward).
There may be times when we want to go the other direction: if we want to compute or plot something that depends on ordered pairs (or ordered tuples more generally), such as computing a correlation, or creating a new variable via mutate()
that depends on both entries, then it is probably easier if the coordinates of those pairs (or tuples) are stored in separate variables.
The pivot_wider()
function does this:
Code:
births_wide <- births_long %>%
pivot_wider(
names_from = source, # Unlike with pivot_longer(), this should name an existing variable
values_from = births # Unlike with pivot_longer(), this should name an existing variable
)
For pivot_wider()
to work, it needs to know exactly what to put in each new column for each different value of the variable named in names_from
. In order for the “wide” format to be well-defined, there must be a unique mapping from each combination of the variables we aren’t reshaping (in this case just year
) to a value, for each different key.
If there is more than one row with the same entries in the other variables, pivot_wider()
won’t have any way of knowing which rows to merge. It will still run, but you’ll end up with a list of the values from the column named in values_from
that share the same entries elsewhere in a single entry of the widened data table.
To illustrate this, let’s create a dataset that has two instances of every row by stacking a second copy of the births_long
below the first one.
Now let’s try to “widen” the result.
births_wide_messy <- births_long_duplicated %>%
pivot_wider(
names_from = "source",
values_from = "births"
)
## Warning: Values are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list` to suppress this warning.
## * Use `values_fn = length` to identify where the duplicates arise
## * Use `values_fn = {summary_fun}` to summarise duplicates
births_wide_messy
data in RStudio’s data viewer, and verify that each entry in the census
and ssa
columns is a list of two values rather than a single value. Probably not going to do anything nice for us later if we try to use this.If, as is the case here, the problem is caused by actual duplicate entries in the data, we can just remove these duplicates before passing the data to pivot_wider()
. There are multiple ways to do that, but an easy one is just to pass the data to the distinct()
function, to remove completely redundant rows.
births_long_deduplicated
table is the same as the original births_long
data.Let’s try pivot_wider()
again on the deduplicated data.
## # A tibble: 5 x 3
## year census ssa
## <dbl> <int> <int>
## 1 1880 NA 201484
## 2 1881 NA 192696
## 3 1882 NA 221533
## 4 1883 NA 216946
## 5 1884 NA 243462
Looks just like the data we started with (except for the order of the columns, and the removal of the duplicate rows)!
Now we can produce a scatterplot… Code:
births_wide %>%
ggplot(aes(x = ssa, y = census)) +
geom_point(na.rm = TRUE) +
scale_x_continuous(
name = "Births Recorded by the SSA (Millions)",
limits = c(0,5000000),
breaks = seq(0,5000000, by = 1000000),
labels = 0:5) +
scale_y_continuous(
name = "Births Recorded by the Census (Millions)",
limits = c(2000000,5000000),
breaks = seq(2000000,5000000, by = 1000000),
labels = 2:5)
(which we could do with the original data, but not with the “long” format data; here we’re just undoing what we did, but in real applications we will sometimes have data come to us in “long” format and need to convert it to “wide”)
We can also do things like compute the correlation between the sources (which also would have been difficult in the “long” form):
Code:
## The cor() function from the mosaic package has a nicer interface than the default cor()
## function, because it has a data= argument.
## The use= argument tells cor() how to handle missing data.
mosaic::cor(census ~ ssa, data = births_wide, use = "pairwise.complete")
## [1] 0.8945154
or compute and plot the amount of undercounting in the SSA data relative to the Census Code:
births_wide %>%
mutate(percent_in_ssa = ssa / census * 100) %>%
ggplot(aes(x = year, y = percent_in_ssa)) +
geom_line(na.rm = TRUE) +
scale_x_continuous(name = "Year") +
scale_y_continuous(name = "% of census births recorded by SSA")
There are some things we have tried to do before with the original babynames
data where widening would have allowed a more concise solution.
For example, recall that the rows in the babynames
dataset correspond to unique combinations of year, name, and sex. If we wanted to find the total number of births associated with a particular name in a particular year irrespective of sex, we have can use group_by()
and summarize()
to get the total by year.
filter()
, group_by()
and summarize()
to find the number of births for each name in each year from 1950 to 2000. To reduce the size of the data (and the strain on the server), do the filter()
step separately and save the resulting dataset as babynames1950to2000
.## # A tibble: 820,608 x 3
## # Groups: year [51]
## year name num_births
## <dbl> <chr> <int>
## 1 1950 Aaron 805
## 2 1950 Abagail 5
## 3 1950 Abbe 10
## 4 1950 Abbey 7
## 5 1950 Abbie 68
## 6 1950 Abbott 9
## 7 1950 Abby 73
## 8 1950 Abdul 8
## 9 1950 Abe 46
## 10 1950 Abel 170
## # … with 820,598 more rows
This is fine, but suppose we wanted both the total and the number associated with the two recorded sexes in a single table. Previously this involved the somewhat awkward step of using ifelse()
with mutate()
to create two separate count columns, one for each value of sex
, which contain zeroes if the row doesn’t correspond do that sex, then using group_by()
and summarize()
to get the totals both overall and by sex for each name in each year.
Code:
babynames1950to2000 %>%
rename(num_births = n) %>%
mutate(
num_if_amab = ifelse(sex == "M", num_births, 0),
num_if_afab = ifelse(sex == "F", num_births, 0)) %>%
group_by(year, name) %>%
summarize(
num_amab = sum(num_if_amab),
num_afab = sum(num_if_afab),
total_births = sum(num_births))
## # A tibble: 820,608 x 5
## # Groups: year [51]
## year name num_amab num_afab total_births
## <dbl> <chr> <dbl> <dbl> <int>
## 1 1950 Aaron 798 7 805
## 2 1950 Abagail 0 5 5
## 3 1950 Abbe 0 10 10
## 4 1950 Abbey 0 7 7
## 5 1950 Abbie 0 68 68
## 6 1950 Abbott 9 0 9
## 7 1950 Abby 5 68 73
## 8 1950 Abdul 8 0 8
## 9 1950 Abe 46 0 46
## 10 1950 Abel 170 0 170
## # … with 820,598 more rows
pivot_wider()
to “unstack” the values in the original n
column into two columns: one for births tagged "M"
, and one for births tagged "F"
. Try to get the same result as above using this approach. Note 1: Remember that pivot_wider()
needs the rows it’s merging to match exactly by default. In our case, the prop
variable will throw this off, so use id_cols = c(year, name)
to tell pivot_wider()
to match only on year
and name
, and not to try to match based on prop
(or anything else). This will have the effect of removing the data in prop
from the results. Note 2: One of the side effects of using group_by()
in the previous solution is that the names are sorted in alphabetical order within each year. To get your results in the same order as above, you may want to do arrange(year, name)
on the final output to sort first by name then by year.babynames_1950to2000_wide <-
babynames1950to2000 %>%
rename(num_births = n) %>%
pivot_wider(
id_cols = c(year, name),
names_from = sex,
values_from = num_births,
values_fill = 0 # what value to use if no row exists for a particular key combination
# (if omitted will produce NA)
) %>%
rename(
num_afab = "F",
num_amab = "M"
) %>%
mutate(
total_births = num_amab + num_afab
) %>%
arrange(year, name)
## # A tibble: 10 x 5
## year name num_afab num_amab total_births
## <dbl> <chr> <int> <int> <int>
## 1 1950 Aaron 7 798 805
## 2 1950 Abagail 5 0 5
## 3 1950 Abbe 10 0 10
## 4 1950 Abbey 7 0 7
## 5 1950 Abbie 68 0 68
## 6 1950 Abbott 0 9 9
## 7 1950 Abby 68 5 73
## 8 1950 Abdul 0 8 8
## 9 1950 Abe 0 46 46
## 10 1950 Abel 0 170 170
By the way, using the “widened” data, we can quite easily produce a measure of how asymmetrically a name is used for male and female babies.
## Let's find the names with the least asymmetry, aggregating over time:
babynames_1950to2000_wide %>%
filter(total_births >= 100) %>% # exclude very uncommon names in a given year
mutate(
prop_amab = num_amab / total_births,
prop_afab = num_afab / total_births,
asymmetry = abs(prop_amab - prop_afab)
) %>%
group_by(year) %>%
slice_min(order_by = asymmetry)
## # A tibble: 54 x 8
## # Groups: year [51]
## year name num_afab num_amab total_births prop_amab prop_afab asymmetry
## <dbl> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1950 Kris 104 105 209 0.502 0.498 0.00478
## 2 1951 Kris 149 145 294 0.493 0.507 0.0136
## 3 1952 Pat 300 328 628 0.522 0.478 0.0446
## 4 1953 Lave… 103 103 206 0.5 0.5 0
## 5 1953 Tracy 326 326 652 0.5 0.5 0
## 6 1954 Pat 291 310 601 0.516 0.484 0.0316
## 7 1955 Dana 1569 1568 3137 0.500 0.500 0.000319
## 8 1956 Shawn 548 567 1115 0.509 0.491 0.0170
## 9 1957 Kelly 1906 1868 3774 0.495 0.505 0.0101
## 10 1958 Cour… 71 78 149 0.523 0.477 0.0470
## # … with 44 more rows
Find an interesting dataset from the Gapminder repository here and download the .csv
.
Upload the .csv
file to RStudio (assuming you are working on the server), and read it in using read_csv()
(the function with an underscore is part of the tidyverse, and tends to work better than the built-in one with a period). You will need to supply as the argument to read_csv()
the path to the file relative to the directory where your .Rmd
is.
rename()
to change the first variable name to country
(since rename()
expects a variable name without quotes, you may need to surround the original variable name with backticks (the same syntax used to get a code font in Markdown) if it has spaces or special characters. Pro-tip: never use spaces or special characters (other than underscores) in variable names.Example:
We will convert our data to a format with exactly three columns: country
, year
, and value
(whatever value
is for your chosen dataset). Before you write any code, sketch on paper what the “tidified” data will look like. Be sure to indicate how many rows it will have.
Use pivot_longer()
to convert your data into this format.
year
variable may be stored as text instead of as a number, which will make mapping it to a positional visual cue challenging. Fix this using mutate()
, with the help of the parse_date()
function (supplied by the readr
package, which is also part of the tidyverse). Type ?parse_date
at the console to see how to use it if it’s not clear.country
to color
. Post your graph from Exercise 8 to the #lab8 channel.