Become comfortable recognizing when reshaping data will make it better suited to the task at hand, and learn how to do so with the gather()
and spread()
verbs in the tidyr
package (part of the all-powerful tidyverse
).
Should we try to squeeze some more insight out of the babynames
data? Let’s try to squeeze some more insight out of the babynames
data. At least for starters.
Load the packages and data:
In the last lab, we joined the Social Security babynames
data with the Census births
data to produce a table that had two records of the total number of births in each year; one from each source.
Here’s the code we used to do it (below is the “full join” version).
Code:
census_births <- births
ssa_births <- babynames %>%
rename(num_births = n) %>%
group_by(year) %>%
summarize(num_rows = n(), births = sum(num_births))
total_births <-
ssa_births %>%
full_join(census_births, by = "year")
head(total_births)
## # A tibble: 6 x 4
## year num_rows births.x births.y
## <dbl> <int> <int> <int>
## 1 1880 2000 201484 NA
## 2 1881 1935 192696 NA
## 3 1882 2127 221533 NA
## 4 1883 2084 216946 NA
## 5 1884 2297 243462 NA
## 6 1885 2294 240854 NA
The births.x
and births.y
variables are not very descriptive; also we don’t care so much about the num_rows
variable, so let’s do some select
ion (to remove num_rows
) and rename
ing (to replace the uninformative names with informative ones).
Code:
total_births <- total_births %>%
select(-num_rows) %>%
rename(ssa = births.x, census = births.y)
head(total_births)
## # A tibble: 6 x 3
## year ssa census
## <dbl> <int> <int>
## 1 1880 201484 NA
## 2 1881 192696 NA
## 3 1882 221533 NA
## 4 1883 216946 NA
## 5 1884 243462 NA
## 6 1885 240854 NA
If we want to visualize the number of births over time from two different sources using two overlaid lines, we have to set the y
aesthetic separately for each line, and if we want different colors, we have to specify them manually, line by line:
Code:
total_births %>%
ggplot(aes(x = year)) +
geom_line(aes(y = census), color = "blue") +
geom_line(aes(y = ssa), color = "orange") +
xlab("Year") + ylab("Total Births")
## Warning: Removed 29 rows containing missing values (geom_path).
We also don’t get an automatic legend.
For a graph like this, we’d like to be able to create an aesthetic mapping between the source of the data and the color of the line. That mapping could then be used to automatically produce a legend. But source
isn’t a variable in this data; it’s distinguished between variables, not between cases.
gather()
functionThinking about what the legend title would be if we created one gives us a clue that we need to wrangle this data into a format conducive to the plot we want.
We need a new variable called something like source
, and a single variable to map to the \(y\)-axis, recording the number of births from the respective source.
We can use gather()
for this, as follows:
Code:
births_long <- total_births %>%
gather(
key = "source", # this is what we're going to call the new variable
value = "births", # this is where the values will go in "long" format
census, ssa # these are the columns we're "merging"
# (we could have also said -year, to merge all
# columns *except* year
)
head(births_long)
## # A tibble: 6 x 3
## year source births
## <dbl> <chr> <int>
## 1 1880 census NA
## 2 1881 census NA
## 3 1882 census NA
## 4 1883 census NA
## 5 1884 census NA
## 6 1885 census NA
Having created the source
variable and having merged all the counts into a single births
variable, we can now create the line graph we want quite easily (and we get a legend automatically, since the color of the line now comes from a variable in the data table)
Code:
## Warning: Removed 29 rows containing missing values (geom_path).
spread()
functionIs the “long” format we’ve created “better” in an absolute sense? Well, it’s better for producing the line graph we wanted, but suppose we wanted to visualize the correlation between the sources with a scatterplot. For a plot like this, we want one axis to be the number of births according to the SSA, and the other axis to be the number of births according to the Census. This was easy in the original data:
Code:
## Warning: Removed 29 rows containing missing values (geom_point).
If the data had come to us in the “long” format, however, it would be much less obvious how to create this plot. It’s also not so clear how we’d do something like compute the correlation, or the difference between the estimates in a particular year (I suppose we could use group_by()
and summarize()
to do this last one, but it wouldn’t be that straightforward).
There may be times when we want to go the other direction: if we want to compute or plot something that depends on ordered pairs (or ordered tuples more generally), such as computing a correlation, or creating a new variable via mutate()
that depends on both entries, then it is probably easier if the coordinates of those pairs (or tuples) are stored in separate variables.
The spread()
function does this:
Code:
## Generates an error
births_wide <- births_long %>%
spread(
key = source, # Unlike with gather(), key= should name an existing variable
value = births # Unlike with gather(), value= should name an existing variable
)
Uh oh, we get an error. What’s that about? It seems that the years from 2002 on have duplicate records in the original births
dataset:
Code:
## # A tibble: 10 x 3
## year source births
## <dbl> <chr> <int>
## 1 2008 census 4247694
## 2 2009 census 4130665
## 3 2010 census 3999386
## 4 2011 census 3953590
## 5 2012 census 3952841
## 6 2013 census 3932181
## 7 2014 census 3988076
## 8 2015 census 3978497
## 9 2016 census 3945875
## 10 2017 census 3855500
For spread()
to work, it needs to know exactly what to put in each new column for each different key (each source
in this case). In order for the “wide” format to be well-defined, there must be a unique mapping from each combination of the other variables (in this case just year
) to a value, for each different key. That’s violated here, and so spread()
doesn’t know what to do.
Since in this case the problem is caused by duplicate entries, we can just remove the duplicates before spread()
ing the data. There are multiple ways to do that; here’s an easy one:
## Use distinct() to find and remove duplicate entries
births_long <- births_long %>%
distinct()
## The problematic rows were the last few years
## from the census; let's look at those
births_long %>%
filter(source == "census") %>%
tail()
## # A tibble: 6 x 3
## year source births
## <dbl> <chr> <int>
## 1 2012 census 3952841
## 2 2013 census 3932181
## 3 2014 census 3988076
## 4 2015 census 3978497
## 5 2016 census 3945875
## 6 2017 census 3855500
OK, let’s try spread()
ing the data again.
## # A tibble: 6 x 3
## year census ssa
## <dbl> <int> <int>
## 1 1880 NA 201484
## 2 1881 NA 192696
## 3 1882 NA 221533
## 4 1883 NA 216946
## 5 1884 NA 243462
## 6 1885 NA 240854
Looks just like the data we started with (except for the order of the columns, and the removal of the duplicate rows)!
Now we can produce a scatterplot… Code:
## Warning: Removed 29 rows containing missing values (geom_point).
(which we could do with the original data, but not with the “long” format data; here we’re just undoing what we did, but in real applications we will sometimes have data come to us in “long” format and need to convert it to “wide”)
We can also compute the correlation: Code:
## The cor() function in mosaic is nicer than the standard one
## since it has a data= argument. The use= argument tells
## cor() how to handle missing data.
mosaic::cor(census ~ ssa, data = births_wide, use = "pairwise.complete")
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
## [1] 0.8945154
or compute and plot the discrepancy: Code:
births_wide %>%
mutate(discrepancy = census - ssa) %>%
ggplot(aes(x = year, y = discrepancy)) +
geom_line() + ylab("Census Estimate - SSA Estimate")
## Warning: Removed 29 rows containing missing values (geom_path).
Even the original data has some features to its organization that might require spread()
ing in order to easily compute or visualize certain things.
For example, recall that the rows in the babynames
dataset correspond to unique combinations of year, name, and sex. If we wanted to find the total number of births associated with a particular name in a particular year irrespective of sex, we have previously needed to use group_by()
and summarize()
like so:
Code:
## Just to reduce the size of the data, we'll filter first by year
babynames1950to2000 <- babynames %>% filter(year >= 1950 & year <= 2000)
babynames1950to2000 %>%
group_by(year, name) %>%
summarize(num_births = sum(n))
## # A tibble: 820,608 x 3
## # Groups: year [51]
## year name num_births
## <dbl> <chr> <int>
## 1 1950 Aaron 805
## 2 1950 Abagail 5
## 3 1950 Abbe 10
## 4 1950 Abbey 7
## 5 1950 Abbie 68
## 6 1950 Abbott 9
## 7 1950 Abby 73
## 8 1950 Abdul 8
## 9 1950 Abe 46
## 10 1950 Abel 170
## # … with 820,598 more rows
which is fine, but suppose we wanted both the total and the number associated with the two recorded sexes in a single table. Previously this involved the somewhat awkward step of using ifelse()
to tally up the number of births for a subset of the data:
Code:
babynames1950to2000 %>%
group_by(year, name) %>%
summarize(
num_males = sum(ifelse(sex == "M", 1, 0)),
num_females = sum(ifelse(sex == "F", 1, 0)),
total_births = sum(n))
## # A tibble: 820,608 x 5
## # Groups: year [51]
## year name num_males num_females total_births
## <dbl> <chr> <dbl> <dbl> <int>
## 1 1950 Aaron 1 1 805
## 2 1950 Abagail 0 1 5
## 3 1950 Abbe 0 1 10
## 4 1950 Abbey 0 1 7
## 5 1950 Abbie 0 1 68
## 6 1950 Abbott 1 0 9
## 7 1950 Abby 1 1 73
## 8 1950 Abdul 1 0 8
## 9 1950 Abe 1 0 46
## 10 1950 Abel 1 0 170
## # … with 820,598 more rows
A more elegant solution to achieve this same thing would be to spread()
the values in the original n
column into two columns: one for births tagged "M"
, and one for births tagged "F"
.
babynames_wide1950to2000 <-
babynames1950to2000 %>%
select(-prop) %>%
spread(
key = sex, # which existing column defines the groups?
value = n, # which existing column contains the values?
fill = 0 # what value to use if no row exists for a particular key combination
# (if omitted will produce NA)
)
head(babynames_wide1950to2000)
## # A tibble: 6 x 4
## year name F M
## <dbl> <chr> <dbl> <dbl>
## 1 1950 Aaron 7 798
## 2 1950 Abagail 5 0
## 3 1950 Abbe 10 0
## 4 1950 Abbey 7 0
## 5 1950 Abbie 68 0
## 6 1950 Abbott 0 9
Note that because the original data also had a prop
column, showing the proportion of births in a particular year,sex combination that had a specific name, using spread()
without first removing this column would not do what we want. This is because spread()
needs to be able to group rows together that have identical values for all variables except the ones specified as the key=
and the value=
. And prop
is different for each sex
; so it would cause the output to have two rows for each name again: one with a value for F
and missing data for M
, and one with the reverse.
By removing prop
before we spread, we ensure that all the non-key
non-value
columns will be identical for the group of rows that is to be consolidated into a single row.
Now that we have F
and M
as separate columns, we can simply use mutate()
to find the total number of births with each name in each year.
Code:
By the way, now we can quite easily produce a measure of the “unisex” quality of a name.
(Make sure you understand what’s going on at each step below: this is all involving verbs you’ve worked with before) Code:
## Let's find the names with the least asymmetry, aggregating over time:
babynames_wide1950to2000 %>%
group_by(name) %>%
summarize(
M = sum(M),
F = sum(F),
total = sum(total)
) %>%
filter(total >= 10000) %>% # exclude very uncommon names
mutate(
prop_male = M/total,
asymmetry = abs(prop_male - 0.50)
) %>%
arrange(asymmetry)
## # A tibble: 1,533 x 6
## name M F total prop_male asymmetry
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Unknown 6949 6967 13916 0.499 0.000647
## 2 Kris 12369 12742 25111 0.493 0.00743
## 3 Carey 12231 11708 23939 0.511 0.0109
## 4 Peyton 9576 10006 19582 0.489 0.0110
## 5 Kerry 43412 46000 89412 0.486 0.0145
## 6 Jaime 50004 47051 97055 0.515 0.0152
## 7 Ashton 12052 13180 25232 0.478 0.0224
## 8 Blair 10351 9215 19566 0.529 0.0290
## 9 Kendall 23721 20757 44478 0.533 0.0333
## 10 Justice 6100 5312 11412 0.535 0.0345
## # … with 1,523 more rows
Find an interesting dataset from the Gapminder repository here and download it first as an Excel spreadsheet (an .xlsx
file). Open it in Excel or a similar program, and export it as a .csv
.
Upload the .csv
file to RStudio (assuming you are working on the server), and read it in using read_csv()
(the function with an underscore is part of the tidyverse, and tends to work better than the built-in one with a period). You will need to supply as the argument to read_csv()
the path to the file relative to the directory where your .Rmd
is.
The cases in the Gapminder datasets are countries. Use rename()
to change the first variable name to country
(since rename()
expects a variable name without quotes, you may need to surround the oriiginal variable name with backticks (the same syntax you use to get a code font in Markdown) if it has spaces or special characters. Pro-tip: never use spaces or special characters (other than underscores) in variable names.
Example:
Use dim()
to find out how many rows and columns are in your data.
We will convert our data to a format with exactly three columns: country
, year
, and value
(whatever value is for your chosen dataset). Before you write any code, sketch on paper what the “tidified” data will look like. Be sure to indicate how many rows it will have.
Use gather()
to convert your data into this format. Use head()
and dim()
to verify that it worked as expected.
The year
variable may be stored as text instead of as a number, which will make mapping it to a spatial dimension challenging. Fix this using mutate()
, with the help of the parse_number()
function (supplied by the readr
package, which is also part of the tidyverse). Type ?parse_number
at the console to see how to use it if it’s not clear.
Plot your variable as a time series line graph, using color to distinguish countries.