Aggregating Data by Group

Goal

To practice summarizing subsets of data separately using group_by() together with summarize()

Resources

For convenience, the link to the dplyr reference sheet is here

The Data

We’ll continue some more with the babynames dataset. We’ll start out exploring the question I asked you to “think about” but not actually answer from the last lab:

  1. For a chosen name, find the year that that name was the most equally split between male and female babies: that is, the year when the sex distribution was closest to 50/50.

Preliminaries (loading packages and data):

library(tidyverse)
library(babynames)

For the examples below I’ll look at the name “Jessie”, which bounced around in its gender connotation during the period of this data, but has consistently been at least reasonably popular for both males and females.

One way to answer the question of interest would be to extract the counts for “Jessie” and manually scan the data to see when they are closest to equal.

Code:

babynames %>%
  filter(name == "Jessie")
## # A tibble: 272 x 5
##     year   sex   name     n        prop
##    <dbl> <chr>  <chr> <int>       <dbl>
##  1  1880     F Jessie   635 0.006505881
##  2  1880     M Jessie   154 0.001300687
##  3  1881     F Jessie   661 0.006686561
##  4  1881     M Jessie   143 0.001320626
##  5  1882     F Jessie   806 0.006966533
##  6  1882     M Jessie   192 0.001573371
##  7  1883     F Jessie   833 0.006938255
##  8  1883     M Jessie   151 0.001342485
##  9  1884     F Jessie   888 0.006454145
## 10  1884     M Jessie   177 0.001442084
## # ... with 262 more rows

But this is time consuming and requires mental arithmetic.

We can create a quick plot to estimate the solution visually:

Code:

babynames %>% 
  filter(name == "Jessie") %>%
  ggplot(aes(x = year, y = n, color = sex)) +
  geom_line()

It looks like the lines cross somewhere around 1950, and then cross a few more times between 1980 and 2010 or so. But this still isn’t an exact solution.

What we’d like is to create a variable that actually tells us the relative number of males and females associated with the name “Jessie” in each year.

  1. Often when we want to create a new variable, we can use mutate(). But that won’t work in this case. Why not?

Explanation:

# mutate() operates on one case at a time, but the proportion
# of males and females requires a comparison involving two 
# rows at the same time

The quantity we are interested in requires us to aggregate data from multiple cases. This is what summarize() is good for: take a dataset and apply a function that takes a set of values as input and returns a single value as output. These functions, such as sum(), mean(), max(), and n(), are called aggregate functions.

For example, I can compute the number of male Jessies born in a particular year (let’s say 1982) by using filter() to extract the data from 1982, and then using summarize() together with sum(), combined with a conditional function like ifelse() to replace the values I’m not interested in with zeroes. Like so:

Code:

babynames %>%
  filter(name == "Jessie" & year == 1982) %>%
  summarize(
    total = sum(n),
    num_males = sum(ifelse(sex == "M", n, 0)))
## # A tibble: 1 x 2
##   total num_males
##   <int>     <dbl>
## 1  2474      1330

The ifelse() command operates on each entry in a variable, returning a list of values the same length as the input. In this case, ifelse(sex == "M", n, 0) says to look at the sex column, and for each entry if it is equal to "M", place the value of n in that same position, otherwise put zero. Then, the sum() function adds up the results.

  1. Take the command above and use mutate() to get the proportion of babies named Jessies in 1982 that were male.

So filter() together with summarize() gives us a proportion for a specific year. We could in principle repeat this for each year in the data and see which one comes out closest to 0.5. But this would be tedious, not to mention error-prone.

Instead, we can use group_by() to “slice” the data by year, and summarize() each slice:

Code:

jessies_by_year <-
  babynames %>%
    filter(name == "Jessie") %>%
    group_by(year) %>%
    summarize(
      num_rows = n(),        # just a sanity check
      total_births = sum(n), # total "Jessies" that year
      num_males = sum(ifelse(sex == "M", n, 0)),
      num_females = sum(ifelse(sex == "F", n, 0))) %>%
    mutate(prop_males = num_males / total_births)
jessies_by_year
## # A tibble: 136 x 6
##     year num_rows total_births num_males num_females prop_males
##    <dbl>    <int>        <int>     <dbl>       <dbl>      <dbl>
##  1  1880        2          789       154         635  0.1951838
##  2  1881        2          804       143         661  0.1778607
##  3  1882        2          998       192         806  0.1923848
##  4  1883        2          984       151         833  0.1534553
##  5  1884        2         1065       177         888  0.1661972
##  6  1885        2         1154       202         952  0.1750433
##  7  1886        2         1184       181        1003  0.1528716
##  8  1887        2         1194       182        1012  0.1524288
##  9  1888        2         1454       254        1200  0.1746905
## 10  1889        2         1403       207        1196  0.1475410
## # ... with 126 more rows

Let’s plot the proportion by year, just for fun:

Code:

jessies_by_year %>%
  ggplot(aes(x = year, y = prop_males)) +
  geom_line() +
  geom_hline(yintercept = 0.5)

We can now see with one line what we could see before by seeing when our two lines crossed.

But we’d still like an exact answer to the question “In what year was the proportion closest to 50/50?”

Before that, though, here are some other questions you can answer with the group_by() and summarize() one-two punch:

  1. In what year was the largest total number of Jessies born (combining sexes)? How many Jessies were born that year?

Sample solution:

### Version using arrange()
babynames %>%
  filter(name == "Jessie") %>%
  group_by(year) %>%
  summarize(total_births = sum(n)) %>%
  arrange(desc(total_births)) %>%
  head(n = 1)

## Version using a second summarize()
babynames %>%
  filter(name == "Jessie") %>%
  group_by(year) %>%
  summarize(total_births = sum(n)) %>%
  summarize(
    peak_number = max(total_births),
    peak_year = year[which.max(total_births)])
  1. Create a summary table that shows, for each name, the first year it appears in the data and the last year it appears in the data.

Sample solution:

babynames %>%
  group_by(name) %>%
  summarize(
    first_year = min(year),
    last_year = max(year))
  1. There are 16 names that were assigned to babies of both sexes in every year from 1880 to 2015. List them. Hint: if a name has zero births for a particular sex in a particular year, there is no entry in the data for that year/sex/name combination.

Sample solution

## Names that appear for both sexes, and only those names, 
## will have 2 x (2015 - 1880 + 1) entries in the table
babynames %>%
  group_by(name) %>%
  summarize(num_rows = n()) %>%
  filter(num_rows == 2 * (2015 - 1880 + 1))

Interestingly, some of these (like John and William) have a strong connotation with a particular sex today, but they are such common names overall that a few instances of the opposite sex appear.

  1. Write a pipeline to return the 10 most common names (combining sexes) of the 1990s.

Sample solution:

babynames %>%
  filter(year >= 1990 & year < 2000) %>%
  group_by(name) %>%
  summarize(num_births = sum(n)) %>%
  arrange(desc(num_births)) %>%
  head(n = 10)
## # A tibble: 10 x 2
##           name num_births
##          <chr>      <int>
##  1     Michael     464221
##  2 Christopher     361234
##  3     Matthew     352320
##  4      Joshua     330029
##  5     Jessica     303837
##  6      Ashley     303113
##  7       Jacob     298871
##  8    Nicholas     275899
##  9      Andrew     273475
## 10      Daniel     273274
  1. This one is hard: Find the names that were “popular” in your birth year (defined as being assigned to 1% or more of all births, irrespective of sex), and determine which one made its first appearance latest. That is, what’s the “youngest” popular name for your birth year? (This (I think) requires more than one “pipeline”: one to get a list of popular names in your birth year and another to extract the earliest appearance of each name and sort or summarize. You will probably need to use the DataSet$VariableName syntax to extract a column, and the %in operator to check whether an entry is in a set)!

Sample solution:

popular_names_1982 <- 
  babynames %>%
  filter(year == 1982) %>%
  group_by(name) %>%
  summarize(total_for_name = sum(n)) %>%
  mutate(overall_total = sum(total_for_name)) %>%
  mutate(proportion_of_total = total_for_name / overall_total) %>%
  filter(proportion_of_total >= 0.01) %>%
  select(name)
popular_names_1982
babynames %>%
  filter(name %in% popular_names_1982$name) %>%
  group_by(name) %>%
  summarize(first_year = min(year)) %>%
  summarize(
    youngest_name = name[which.max(first_year)],
    first_appearance = max(first_year))

In 1982, Jennifer was a popular name, and there were (theoretically) no U.S. born Jennifers who at the time were older than 66.

  1. Create and plot a variable that represents the “name diversity” in each year, defined as the number of distinct names that appear in the dataset. If you are so inclined, create additional variables that compute this for only males and for only females.

Sample solution

babynames %>%
  group_by(year) %>%
  summarize(
    total_names = n(),
    num_male_names = sum(ifelse(sex == "M", 1, 0)),
    num_female_names = sum(ifelse(sex == "F", 1, 0))) %>%
  ggplot(aes(x = year)) +
    geom_line(aes(y = total_names), color = "green") +
    geom_line(aes(y = num_male_names), color = "blue") +
    geom_line(aes(y = num_female_names), color = "red") +
    ylab("Number of different names")