STAT 209: Lab5

Five Data-Wrangling Verbs

Goal

Gain practice with the five “fundamental verbs” that are the building blocks in the “grammar of data wrangling”, as implemented in the dplyr package.

The verbs are:

filter()
select()
mutate()
arrange()
summarize()

Resources

You will probably want to look at the reference sheet or the slides from time to time. Remember that knowing how to look things up is an important skill! Nobody memorizes everything.

The Data

We’ll look some more at the babynames dataset for this lab. Let’s make sure it (as well as the tidyverse package) is loaded in our Markdown document.

Code:

library(tidyverse)
library(babynames)
# This is a slightly different dataset than on the last lab
data(babynames) 
glimpse(babynames)

## Observations: 1,924,665
## Variables: 5
## $ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880,…
## $ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", …
## $ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288,…
## $ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843,…

Extracting subsets with `filter()`

Recall that filter() allows us to extract a subset of cases from a dataset, by checking for a particular criterion. Let’s extract female “Bella”s (you might be able to do this without peeking at the code below at this point) and display the first few rows with head().

Code:

babynames %>%
  filter(name == "Bella" & sex == "F") %>%
  head()

## # A tibble: 6 x 5
##    year sex   name      n     prop
##   <dbl> <chr> <chr> <int>    <dbl>
## 1  1880 F     Bella    13 0.000133
## 2  1881 F     Bella    24 0.000243
## 3  1882 F     Bella    16 0.000138
## 4  1883 F     Bella    17 0.000142
## 5  1884 F     Bella    31 0.000225
## 6  1885 F     Bella    25 0.000176

Selecting variables with `select()`

Recall that select() allows us to extract certain columns from a dataset, either by listing each variable name we want to include as a separate argument, or by listing each variable name we want to exclude, or by defining a condition for inclusion/exclusion.

Let’s start with the full babynames dataset (so, not just Bellas) and display the first few rows, retaining only the year, name and n variables.

Code:

babynames %>%
  select(year, name, n) %>%
  head()

## # A tibble: 6 x 3
##    year name          n
##   <dbl> <chr>     <int>
## 1  1880 Mary       7065
## 2  1880 Anna       2604
## 3  1880 Emma       2003
## 4  1880 Elizabeth  1939
## 5  1880 Minnie     1746
## 6  1880 Margaret   1578

Now create a new dataset called Bellas that retains just year and n for the first few years of female Bellas, by chaining filter() and select() together with pipes, and assigning the result. Check that the result looks as it should using head().

Code:

## Note that in this case we could filter and select in either order.
## For some wrangling problems, the order will matter, however.
Bellas <- babynames %>%
  filter(name == "Bella" & sex == "F") %>%
  select(year, n)
head(Bellas)

## # A tibble: 6 x 2
##    year     n
##   <dbl> <int>
## 1  1880    13
## 2  1881    24
## 3  1882    16
## 4  1883    17
## 5  1884    31
## 6  1885    25

Defining new variables with `mutate()`

Suppose we want to split the set of name/sex pairs into those that were “popular” in a given year, defined as being a name that was assigned to at least 1% of all babies of a particular sex (as assigned at birth) that year, and those that were not so popular. We can define a new binary variable based on the prop variable using mutate().

Code:

modified_babynames <- babynames %>%
  mutate(popular = prop >= 0.01)
head(modified_babynames)

## # A tibble: 6 x 6
##    year sex   name          n   prop popular
##   <dbl> <chr> <chr>     <int>  <dbl> <lgl>  
## 1  1880 F     Mary       7065 0.0724 TRUE   
## 2  1880 F     Anna       2604 0.0267 TRUE   
## 3  1880 F     Emma       2003 0.0205 TRUE   
## 4  1880 F     Elizabeth  1939 0.0199 TRUE   
## 5  1880 F     Minnie     1746 0.0179 TRUE   
## 6  1880 F     Margaret   1578 0.0162 TRUE

If we decide the name a variable is given, we can replace it using the rename() function. For example, let’s rename popular to is_popular. This function has the following syntax: rename(dataset, newname1 = oldname1, newname2 = oldname2, ...)

Code

modified_babynames <- modified_babynames %>%
  rename(is_popular = popular)
head(modified_babynames)

## # A tibble: 6 x 6
##    year sex   name          n   prop is_popular
##   <dbl> <chr> <chr>     <int>  <dbl> <lgl>     
## 1  1880 F     Mary       7065 0.0724 TRUE      
## 2  1880 F     Anna       2604 0.0267 TRUE      
## 3  1880 F     Emma       2003 0.0205 TRUE      
## 4  1880 F     Elizabeth  1939 0.0199 TRUE      
## 5  1880 F     Minnie     1746 0.0179 TRUE      
## 6  1880 F     Margaret   1578 0.0162 TRUE

Create a dataset called PopularBabynames that includes only those names that were “popular” in the given year. Use the new is_popular variable to do the filtering, and then remove the variable from the filtered dataset since it is now a “constant”.

Solution:

PopularBabynames <- modified_babynames %>%
  filter(is_popular == TRUE) %>%
  select(-is_popular)
head(PopularBabynames)

## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

Sorting data with `arrange()`

We can easily see at what point the largest share of births went to a single name by sorting the dataset by prop. We can use arrange() for this. To arrange in descending order so that the most popular name is at the top, use the desc() helper function around the variable name.

Code:

babynames %>%
  arrange(desc(prop)) %>%
  head()

## # A tibble: 6 x 5
##    year sex   name        n   prop
##   <dbl> <chr> <chr>   <int>  <dbl>
## 1  1880 M     John     9655 0.0815
## 2  1881 M     John     8769 0.0810
## 3  1880 M     William  9532 0.0805
## 4  1883 M     John     8894 0.0791
## 5  1881 M     William  8524 0.0787
## 6  1882 M     John     9557 0.0783

Describe precisely what the prop variable is telling us here. What does it mean for a name to be “first” in this list?

Solution:

## The prop variable tells us the proportion of babies with a particular
## sex assigned at birth that had the name in question.
## So by sorting on that variable we see the most popular names 
## for each sex (as a share of births for that sex)

Find the most popular male and female name in your birth year.

Possible Solution:

babynames %>%
  filter(year == 1982 & sex == "M") %>%
  arrange(desc(prop))

## # A tibble: 7,364 x 5
##     year sex   name            n   prop
##    <dbl> <chr> <chr>       <int>  <dbl>
##  1  1982 M     Michael     68228 0.0362
##  2  1982 M     Christopher 59225 0.0314
##  3  1982 M     Matthew     46060 0.0244
##  4  1982 M     Jason       40624 0.0215
##  5  1982 M     David       40451 0.0214
##  6  1982 M     James       38872 0.0206
##  7  1982 M     Joshua      38027 0.0202
##  8  1982 M     John        34699 0.0184
##  9  1982 M     Robert      34421 0.0182
## 10  1982 M     Daniel      32653 0.0173
## # … with 7,354 more rows

## For males born in 1982, the most popular name was "Michael"
## (similar for females)

Choose a name, sex pair you like (perhaps your name and sex you identify with, if you identify with one in this dataset). Find the birth year when that name was most popular for babies assigned that sex.

Possible Solution:

babynames %>%
  filter(name == "Colin", sex == "M") %>%
  arrange(desc(prop))

## # A tibble: 127 x 5
##     year sex   name      n    prop
##    <dbl> <chr> <chr> <int>   <dbl>
##  1  2004 M     Colin  5122 0.00242
##  2  2003 M     Colin  4876 0.00232
##  3  2005 M     Colin  4531 0.00213
##  4  2006 M     Colin  3858 0.00176
##  5  2009 M     Colin  3655 0.00172
##  6  2008 M     Colin  3728 0.00171
##  7  2010 M     Colin  3486 0.00170
##  8  2007 M     Colin  3608 0.00163
##  9  2011 M     Colin  3265 0.00161
## 10  2002 M     Colin  3315 0.00160
## # … with 117 more rows

## Colin hit its peak for males in 2004, at about 1/4 of 1%

Find the birth year in which the greatest number of babies were born with your name and sex (or name and sex of choice).
Explain why it could be different than the year in the last question.

Possible Solution:

babynames %>%
  filter(name == "Colin", sex == "M") %>%
  arrange(desc(n))

## # A tibble: 127 x 5
##     year sex   name      n    prop
##    <dbl> <chr> <chr> <int>   <dbl>
##  1  2004 M     Colin  5122 0.00242
##  2  2003 M     Colin  4876 0.00232
##  3  2005 M     Colin  4531 0.00213
##  4  2006 M     Colin  3858 0.00176
##  5  2008 M     Colin  3728 0.00171
##  6  2009 M     Colin  3655 0.00172
##  7  2007 M     Colin  3608 0.00163
##  8  2010 M     Colin  3486 0.00170
##  9  2002 M     Colin  3315 0.00160
## 10  2011 M     Colin  3265 0.00161
## # … with 117 more rows

## In my case it's the same, but since the number of births is changing
## over time, it might be that a name is relatively less popular 
## in a year with more births of a particular sex 
## than in a year with fewer births for that sex, and so the 
## proportion is lower, but the absolute number is higher because
## of the higher total number of births that year

Think about how you might address the following question, without actually trying to do it now: In which year was a particular name (pick any name) the most balanced between males and females; that is, when was the number of male and female births for that name closest to a 50/50 split?

Possible Solution:

## We would need to create a variable that tells us the proportion
## of babies with a particular name who are assigned each sex 
## (say, propsex), and then sort the data for that variable 
## from the "middle out".
## Probably the easiest way to do that if we want to use arrange()
## is to define a second variable equal to, say the absolute value
## of the difference between propsex and 0.5, and sort descending on that

Calculating summary statistics with `summarize()`

The summarize() verb works a little bit differently than the other four verbs. Whereas filter(), select(), mutate(), and arrange() take in a dataset where the rows are cases and the columns are variables and return a dataset in the same form, summarize() takes a dataset where the rows are cases and the columns are variables and returns a dataset with just one row (at least, when it is used by itself), where the columns are summary statistics (things like means, standard deviations, etc.) calculated from all the cases in the input.

Tip: When using summarize(), it is almost always desirable to return as one of the summary statistics the number of cases in the set being summarized. Among other things, this can be a quick way to alert you to errors. The n() function (called with no arguments) is a special helper function that does this.

NB: The babynames data contains a variable called n. Don’t confuse this variable with the function n(). In fact, to prevent confusion, rename the n variable to num_births

Code

modified_babynames <- modified_babynames %>%
  rename(num_births = n)
head(modified_babynames)

## # A tibble: 6 x 6
##    year sex   name      num_births   prop is_popular
##   <dbl> <chr> <chr>          <int>  <dbl> <lgl>     
## 1  1880 F     Mary            7065 0.0724 TRUE      
## 2  1880 F     Anna            2604 0.0267 TRUE      
## 3  1880 F     Emma            2003 0.0205 TRUE      
## 4  1880 F     Elizabeth       1939 0.0199 TRUE      
## 5  1880 F     Minnie          1746 0.0179 TRUE      
## 6  1880 F     Margaret        1578 0.0162 TRUE

Suppose we want to find the year in which the name “Bella” hit its peak for females. We can take the Bellas dataset we created above, and use summarize() together with the max() function to answer this question.

Code:

Bellas <- modified_babynames %>%
  filter(name == "Bella" & sex == "F")
Bellas %>%
  summarize(
    num_rows = n(), 
    most_bellas = max(num_births))

## # A tibble: 1 x 2
##   num_rows most_bellas
##      <int>       <int>
## 1      138        5121

Hmm… This tells us how many female Bellas were born in that name’s peak year for females, but it doesn’t actually tell us the year when that happened. We can use the which.max() function to get the relevant row number, and pull out the year in that position using square bracket element selection syntax, as follows.

Bellas %>%
  summarize(
    num_rows    = n(), 
    peak_number = max(num_births),
    peak_year   = year[which.max(num_births)])

## # A tibble: 1 x 3
##   num_rows peak_number peak_year
##      <int>       <int>     <dbl>
## 1      138        5121      2010

Explain what the num_rows value tells us in context.

Possible Solution:

## Since each row is a year,name,sex combination, and since we have
## filtered to include only rows where name == "Bella" and sex == "F",
## the number of rows here tells us the number of years 
## in the dataset for which we 
## have data about the name "Bella"

Use summarize() instead of arrange() to redo exercises 3 and 4.

Possible Solution:

## Exercise 3
babynames %>%
  filter(year == 1982 & sex == "M") %>%
  summarize(
    num_rows          = n(),
    peak_prop         = max(prop),
    most_popular_name = name[which.max(prop)])

## # A tibble: 1 x 3
##   num_rows peak_prop most_popular_name
##      <int>     <dbl> <chr>            
## 1     7364    0.0362 Michael

## Exercise 4
babynames %>%
  filter(name == "Colin" & sex == "M") %>%
  summarize(
    num_rows          = n(),
    peak_prop         = max(prop),
    peak_year         = year[which.max(prop)])

## # A tibble: 1 x 3
##   num_rows peak_prop peak_year
##      <int>     <dbl>     <dbl>
## 1      127   0.00242      2004

Piping Practice

And now for the cake-decorating portion of the lab. Just kidding.

Recall that when we write

dataset %>% verb(arguments)

this is equivalent to writing

verb(dataset, arguments)

More generally,

some_function(main_argument, other_arguments)

is rewritten as

main_argument %>% some_function(other_arguments)

With just one function it’s not clear that the pipe syntax is any clearer, but when we start chaining operations together, writing the verbs from left to right instead of from inside out (which is how we’d have to do it without the pipe) makes the code a whole lot easier to read.

Re-write the following mess of a command, which displays a list of the top ten years in which male Colins were born in descending order of the number of male babies named Colin that year, using the pipe operator.

head(
  select(
    arrange(
      filter(
        modified_babynames, name == "Colin" & sex == "M"),
      desc(num_births)),
    year, num_births),
  n = 10)

## # A tibble: 10 x 2
##     year num_births
##    <dbl>      <int>
##  1  2004       5122
##  2  2003       4876
##  3  2005       4531
##  4  2006       3858
##  5  2008       3728
##  6  2009       3655
##  7  2007       3608
##  8  2010       3486
##  9  2002       3315
## 10  2011       3265

Possible solution:

modified_babynames %>%
  filter(name == "Colin" & sex == "M") %>%
  arrange(desc(num_births)) %>%
  select(year, num_births) %>%
  head()

Getting Credit

Send a DM to me and Chris Ikeokwu in Slack with the following: * Your code to find the peak birth year for your (chosen) name in two ways: one using arrange() and one using summarize(). * What aspects of the data-wrangling we’ve done so far do you feel most comfortable with after doing this lab? * What aspects do you feel least comfortable with? * Write the honor pledge, signifying that you completed the whole lab (and not just the parts where you had to turn in something)

STAT 209: Lab5

Five Data-Wrangling Verbs

Goal

Resources

The Data

Extracting subsets with filter()

Selecting variables with select()

Defining new variables with mutate()

Sorting data with arrange()

Calculating summary statistics with summarize()

Piping Practice

Getting Credit

Extracting subsets with `filter()`

Selecting variables with `select()`

Defining new variables with `mutate()`

Sorting data with `arrange()`

Calculating summary statistics with `summarize()`