To practice summarizing subsets of data separately using group_by()
together with summarize()
For convenience, the link to the dplyr
reference sheet is here
We’ll continue some more with the babynames
dataset. We’ll start out exploring the question I asked you to “think about” but not actually answer from the last lab:
For the examples below I’ll look at the name “Jessie”, which bounced around in its gender connotation during the period of this data, but has consistently been at least reasonably popular for both males and females.
One way to answer the question of interest would be to extract the counts for “Jessie” and manually scan the data to see when they are closest to equal.
Code:
## # A tibble: 276 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Jessie 635 0.00651
## 2 1880 M Jessie 154 0.00130
## 3 1881 F Jessie 661 0.00669
## 4 1881 M Jessie 143 0.00132
## 5 1882 F Jessie 806 0.00697
## 6 1882 M Jessie 192 0.00157
## 7 1883 F Jessie 833 0.00694
## 8 1883 M Jessie 151 0.00134
## 9 1884 F Jessie 888 0.00645
## 10 1884 M Jessie 177 0.00144
## # … with 266 more rows
But this is time consuming and requires mental arithmetic.
We can create a quick plot to estimate the solution visually:
Code:
It looks like the lines cross somewhere around 1950, and then cross a few more times between 1980 and 2010 or so. But this still isn’t an exact solution.
What we’d like is to create a variable that actually tells us the relative number of males and females associated with the name “Jessie” in each year.
mutate()
. But that won’t work in this case. Why not?Explanation:
summarize()
to do conditional countingThe quantity we are interested in requires us to aggregate data from multiple cases. This is what summarize()
is good for: take a dataset and apply a function that takes a set of values as input and returns a single value as output. These functions, such as sum()
, mean()
, max()
, and n()
, are called aggregate functions.
For example, I can compute the number of male Jessies born in a particular year (let’s say 1982) by using filter()
to extract the data from 1982, and then using summarize()
together with sum()
, combined with a conditional function like ifelse()
to replace the values I’m not interested in with zeroes. Like so:
Code:
babynames %>%
filter(name == "Jessie" & year == 1982) %>%
summarize(
total = sum(n),
num_males = sum(ifelse(sex == "M", n, 0)))
## # A tibble: 1 x 2
## total num_males
## <int> <dbl>
## 1 2474 1330
The ifelse()
command operates on each entry in a variable, returning a list of values the same length as the input. In this case, ifelse(sex == "M", n, 0)
says to look at the sex
column, and for each entry if it is equal to "M"
, place the value of n
in that same position, otherwise put zero. Then, the sum()
function adds up the results.
mutate()
to get the proportion of babies named Jessies in 1982 that were male.Code:
babynames %>%
filter(name == "Jessie" & year == 1982) %>%
summarize(
total = sum(n),
num_males = sum(ifelse(sex == "M", n, 0))) %>%
mutate(prop_males = num_males / total)
## # A tibble: 1 x 3
## total num_males prop_males
## <int> <dbl> <dbl>
## 1 2474 1330 0.538
So filter()
together with summarize()
gives us a proportion for a specific year. We could in principle repeat this for each year in the data and see which one comes out closest to 0.5. But this would be tedious, not to mention error-prone.
group_by()
and summarize()
to summarize by groupInstead, we can use group_by()
to “slice” the data by year, and summarize()
each slice:
Code:
jessies_by_year <-
babynames %>%
filter(name == "Jessie") %>%
group_by(year) %>%
summarize(
num_rows = n(), # just a sanity check
total_births = sum(n), # total "Jessies" that year
num_males = sum(ifelse(sex == "M", n, 0)), # conditional counting
num_females = sum(ifelse(sex == "F", n, 0))) %>%
mutate(prop_males = num_males / total_births)
jessies_by_year
## # A tibble: 138 x 6
## year num_rows total_births num_males num_females prop_males
## <dbl> <int> <int> <dbl> <dbl> <dbl>
## 1 1880 2 789 154 635 0.195
## 2 1881 2 804 143 661 0.178
## 3 1882 2 998 192 806 0.192
## 4 1883 2 984 151 833 0.153
## 5 1884 2 1065 177 888 0.166
## 6 1885 2 1154 202 952 0.175
## 7 1886 2 1184 181 1003 0.153
## 8 1887 2 1194 182 1012 0.152
## 9 1888 2 1454 254 1200 0.175
## 10 1889 2 1403 207 1196 0.148
## # … with 128 more rows
Let’s plot the proportion by year, just for fun:
Code:
jessies_by_year %>%
ggplot(aes(x = year, y = prop_males)) +
geom_line() +
geom_hline(yintercept = 0.5, lty = 3) +
scale_y_continuous(name = "Proportion Male") +
scale_x_continuous(name = "Year", breaks = seq(1880, 2020, by = 10)) +
ggtitle("Gender Breakdown of the name 'Jessie' in the U.S. over Time")
We can now see with one line what we could see before by seeing when our two lines crossed.
But we’d still like an exact answer to the question “In what year was the proportion closest to 50/50?”
summarize()
ing a summaryBefore that, though, here are some other questions you can answer with the group_by()
and summarize()
one-two punch:
summarize()
as just another data set, and use summarize()
a second time to find the year with the largest total)Sample solution:
### Version using arrange()
babynames %>%
filter(name == "Jessie") %>%
group_by(year) %>%
summarize(total_births = sum(n)) %>%
arrange(desc(total_births)) %>%
head(n = 1)
## Version using a second summarize()
babynames %>%
filter(name == "Jessie") %>%
group_by(year) %>%
summarize(total_births = sum(n)) %>%
summarize(
peak_number = max(total_births),
peak_year = year[which.max(total_births)])
group_by()
and summary()
Sample solution:
Sample solution
babynames %>%
group_by(name) %>%
summarize(num_rows = n()) %>%
filter(num_rows == 2 * (2017 - 1880 + 1))
Interestingly, some of these (like John and William) have a strong connotation with a particular sex today, but they are such common names overall that a few instances of the opposite sex appear.
Sample solution:
babynames %>%
filter(year >= 1990 & year < 2000) %>%
group_by(name) %>%
summarize(num_births = sum(n)) %>%
arrange(desc(num_births)) %>%
head(n = 10)
## # A tibble: 10 x 2
## name num_births
## <chr> <int>
## 1 Michael 464249
## 2 Christopher 361251
## 3 Matthew 352341
## 4 Joshua 330046
## 5 Jessica 303854
## 6 Ashley 303125
## 7 Jacob 298926
## 8 Nicholas 275906
## 9 Andrew 273515
## 10 Daniel 273347
pull(DataSet, VariableName)
to extract a column as a free-standing object, and the %in%
operator to check whether an entry is in a set)!Sample solution:
popular_names_1982 <-
babynames %>%
filter(year == 1982) %>%
group_by(name) %>%
summarize(total_for_name = sum(n)) %>%
mutate(
overall_total = sum(total_for_name),
proportion_of_total = total_for_name / overall_total) %>%
filter(proportion_of_total >= 0.01) %>%
select(name)
popular_names_1982
babynames %>%
filter(name %in% pull(popular_names_1982, name)) %>%
group_by(name) %>%
summarize(first_year = min(year)) %>%
summarize(
youngest_name = name[which.max(first_year)],
first_appearance = max(first_year))
In 1982, Jennifer was a popular name, and there were (theoretically) no U.S. born Jennifers who at the time were older than 66.
Sample solution
babynames %>%
group_by(year) %>%
summarize(
total_names = n(),
num_male_names = sum(ifelse(sex == "M", 1, 0)),
num_female_names = sum(ifelse(sex == "F", 1, 0))) %>%
ggplot(aes(x = year)) +
geom_line(aes(y = total_names, color = "Total")) +
geom_line(aes(y = num_male_names, color = "Male")) +
geom_line(aes(y = num_female_names, color = "Female")) +
ylab("Number of different names") +
scale_color_manual(
name = "Sex",
values = c(Total = 1, Male = 2, Female = 3))
Use group_by()
, summarize()
, and ggplot()
together (along with perhaps some other wrangling verbs) to create an interesting and informative visualization on a topic of your choosing (other than baby names). You might want to look back at Lab 2 for links to data sources. Share a code snippet and graphic to the #lab6
channel on Slack.
What did you find particularly interesting/challenging about this lab? Post your response alongside your plot on the #lab6
channel.