To get some focused practice summarizing subsets of data separately using group_by()
together with summarize()
For convenience, the link to the dplyr
reference sheet is here
We’ll continue using the babynames
dataset. We’ll start out exploring the question I asked you to “think about” but not actually answer from the last lab:
For a chosen name, find the year that that name was the most equally split between male and female babies: that is, the year when the sex distribution was closest to 50/50.
For the examples below I’ll look at the name “Jessie”, which bounced around in its dominant gender connotation during the period of this data, but has consistently been at least reasonably popular for both males and females.
One way to answer the question of interest would be to extract the counts for “Jessie” and manually scan the data to see when they are closest to equal.
Code:
## # A tibble: 276 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Jessie 635 0.00651
## 2 1880 M Jessie 154 0.00130
## 3 1881 F Jessie 661 0.00669
## 4 1881 M Jessie 143 0.00132
## 5 1882 F Jessie 806 0.00697
## 6 1882 M Jessie 192 0.00157
## 7 1883 F Jessie 833 0.00694
## 8 1883 M Jessie 151 0.00134
## 9 1884 F Jessie 888 0.00645
## 10 1884 M Jessie 177 0.00144
## # … with 266 more rows
But this is time consuming and error prone, and requires mental arithmetic, especially if we want our criterion to be percentage based.
We can create a quick plot to estimate the solution visually.
babynames
data, separated by sex. What years (approximately) appear to have the closest to a 50/50 split by sex?It looks like the lines cross somewhere around 1950, and then cross a few more times between 1980 and 2010 or so.
But this still isn’t an exact solution.
What we’d like is to create a variable that actually tells us the relative number of males and females associated with the name “Jessie” in each year.
mutate()
. But that won’t work in this case. Why not?mutate()
operates on one case at a time, but the proportion of males and females requires a comparison involving two rows at the same time
summarize()
to do conditional countingThe quantity we are interested in requires us to aggregate data from multiple cases. This is what summarize()
is good for: take a dataset and apply a function that takes a set of values as input and returns a single value as output. These functions, such as sum()
, mean()
, max()
, and n()
, are called aggregation functions.
For example, I can compute the number of male Jessies born in a particular year (let’s say 1982) by using filter()
to extract the data from 1982, and then using summarize()
together with sum()
, combined with a conditional function like ifelse()
to replace the values I’m not interested in with zeroes. Like so:
Code:
babynames %>%
filter(name == "Jessie" & year == 1982) %>%
mutate(
count_if_males = ifelse(sex == "M", n, 0),
count_if_females = ifelse(sex == "F", n, 0)) %>%
summarize(
total = sum(n),
total_males = sum(count_if_males),
total_females = sum(count_if_females))
## # A tibble: 1 x 3
## total total_males total_females
## <int> <dbl> <dbl>
## 1 2474 1330 1144
The ifelse()
command operates on each entry in a variable, evaluating the condition given in its first argument for that entry, and returning the second argument if the condition is met, and the third if it isn’t. In this case, ifelse(sex == "M", n, 0)
says to look at the sex
column, and if it is equal to "M"
, return the value n
, otherwise return 0. Then, the sum()
function adds up the results.
mutate()
again on the summarized results to get the proportion of babies named Jessie in 1982 that were male.babynames %>%
filter(name == "Jessie" & year == 1982) %>%
mutate(
count_if_males = ifelse(sex == "M", n, 0),
count_if_females = ifelse(sex == "F", n, 0)) %>%
summarize(
total = sum(n),
total_males = sum(count_if_males),
total_female = sum(count_if_females)) %>%
mutate(
prop_males = total_males / total)
## # A tibble: 1 x 4
## total total_males total_female prop_males
## <int> <dbl> <dbl> <dbl>
## 1 2474 1330 1144 0.538
So filter()
together with summarize()
gives us a proportion for a specific year. We could in principle repeat this for each year in the data and see which one comes out closest to 0.5. But this would be tedious, not to mention error-prone.
group_by()
and summarize()
to summarize by groupInstead, we can use group_by()
to “slice” the data by year, and summarize()
each slice:
Code:
jessies_by_year <- babynames %>%
filter(name == "Jessie") %>%
mutate(
count_if_males = ifelse(sex == "M", n, 0),
count_if_females = ifelse(sex == "F", n, 0)) %>%
group_by(year) %>%
summarize(
num_rows = n(), # just a sanity check
total_births = sum(n), # total "Jessies" that year
total_males = sum(count_if_males), # conditional counting
total_females = sum(count_if_females)) %>%
mutate(prop_males = total_males / total_births)
jessies_by_year
## # A tibble: 138 x 6
## year num_rows total_births total_males total_females prop_males
## <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 1880 2 789 154 635 0.195
## 2 1881 2 804 143 661 0.178
## 3 1882 2 998 192 806 0.192
## 4 1883 2 984 151 833 0.153
## 5 1884 2 1065 177 888 0.166
## 6 1885 2 1154 202 952 0.175
## 7 1886 2 1184 181 1003 0.153
## 8 1887 2 1194 182 1012 0.152
## 9 1888 2 1454 254 1200 0.175
## 10 1889 2 1403 207 1196 0.148
## # … with 128 more rows
Let’s plot the proportion by year, just for fun:
Code:
jessies_by_year %>%
ggplot(aes(x = year, y = prop_males)) +
geom_line() +
geom_hline(
yintercept = 0.5,
lty = 3) +
scale_y_continuous(
name = "% Male",
breaks = seq(from = 0, to = 1, by = 0.1), # Modifying where tick marks and gridlines are drawn
labels = seq(from = 0, to = 100, by = 10)) +
scale_x_continuous(
name = "Year",
breaks = seq(from = 1880, to = 2020, by = 10)) +
ggtitle("Gender Breakdown of the name 'Jessie' in the U.S. over Time")
We can now see with one line what we could see before by seeing when our two lines crossed.
But we’d still like an exact answer to the question “In what year was the proportion closest to 50/50?”
group_by()
and summarize()
comboBefore we tackle that, though, here are some other questions you can answer with the group_by()
and summarize()
one-two punch:
group_by()
and summarize()
together with any other relevant verbs to answer both of these questions in a single pipeline. Your code should return a summarized data frame with two columns: peak_year
and peak_jessies
. The rows should consist of only the year(s) when Jessie hit its all time peak, along with the number of births in that (those) year(s).babynames %>%
filter(name == "Jessie") %>%
group_by(year) %>%
summarize(total_births = sum(n)) %>%
slice_max(order_by = total_births, n = 1) %>%
rename(peak_year = year, peak_jessies = total_births)
min()
and max()
aggregation functions to find the smallest and largest values (respectively) in a set of data. (Warning: the result is pretty big, so it might put some strain the server when everyone is working on it at the same time. Save your files before running this one, and if it hangs, don’t close your browser tab – instead try closing your RStudio project, logging out and then logging in again.)babynames %>%
group_by(name) %>%
summarize(num_rows = n()) %>%
filter(num_rows == 2 * (2017 - 1880 + 1))
Interestingly, some of these (like John and William) have a strong connotation with a particular sex today, but they are such common names overall that a few instances of the opposite sex appear in every single year.
babynames %>%
filter(year >= 1990 & year < 2000) %>%
group_by(name) %>%
summarize(num_births = sum(n)) %>%
slice_max(order_by = num_births, n = 10)
## # A tibble: 10 x 2
## name num_births
## <chr> <int>
## 1 Michael 464249
## 2 Christopher 361251
## 3 Matthew 352341
## 4 Joshua 330046
## 5 Jessica 303854
## 6 Ashley 303125
## 7 Jacob 298926
## 8 Nicholas 275906
## 9 Andrew 273515
## 10 Daniel 273347
pull(DataSet, VariableName)
to extract a column as a free-standing object, and the %in%
operator to check whether an entry is in a set)!Use group_by()
, summarize()
, and ggplot()
together (along with perhaps some other wrangling verbs) to create an interesting and informative visualization on a topic of your choosing (other than baby names). You might want to look back at Lab 2 for links to data sources. Share a code snippet and graphic to the #lab7
channel on Slack.
What did you find particularly interesting/challenging about this lab? Post your response alongside your plot on the #lab7
channel.