Five Data-Wrangling Verbs

Goal

Gain practice with the five “fundamental verbs” that are the building blocks in the “grammar of data wrangling”, as implemented in the dplyr package.

The verbs are:

  • filter()
  • select()
  • mutate()
  • arrange()
  • summarize()

Resources

You will probably want to look at the reference sheet from time to time. Remember that knowing how to look things up is an important skill! Nobody memorizes everything.

The Data

We’ll look some more at the babynames dataset for this lab. Let’s make sure it (as well as the tidyverse package) is loaded in our Markdown document.

Code:

library(tidyverse)
library(babynames)
# This is a slightly different dataset than on the last lab
data(babynames) 
glimpse(babynames)
## Observations: 1,858,689
## Variables: 5
## $ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 188...
## $ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret"...
## $ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 128...
## $ prop <dbl> 0.072384329, 0.026679234, 0.020521700, 0.019865989, 0.017...

Extracting subsets with filter()

Recall that filter() allows us to extract a subset of cases from a dataset, by checking for a particular criterion. Let’s extract female “Bella”s (you might be able to do this without peeking at the code below at this point) and display the first few rows with head().

Code:

babynames %>%
  filter(name == "Bella" & sex == "F") %>%
  head()
## # A tibble: 6 x 5
##    year   sex  name     n         prop
##   <dbl> <chr> <chr> <int>        <dbl>
## 1  1880     F Bella    13 0.0001331913
## 2  1881     F Bella    24 0.0002427798
## 3  1882     F Bella    16 0.0001382935
## 4  1883     F Bella    17 0.0001415970
## 5  1884     F Bella    31 0.0002253136
## 6  1885     F Bella    25 0.0001761196

Selecting variables with select()

Recall that select() allows us to extract certain columns from a dataset, either by listing each variable name we want to include as a separate argument, or by listing each variable name we want to exclude, or by defining a condition for inclusion/exclusion.

Let’s start with the full babynames dataset (so, not just Bellas) and display the first few rows, retaining only the year, name and n variables.

Code:

babynames %>%
  select(year, name, n) %>%
  head()
## # A tibble: 6 x 3
##    year      name     n
##   <dbl>     <chr> <int>
## 1  1880      Mary  7065
## 2  1880      Anna  2604
## 3  1880      Emma  2003
## 4  1880 Elizabeth  1939
## 5  1880    Minnie  1746
## 6  1880  Margaret  1578

Now create a new dataset called Bellas that retains these same columns for the first few years of female Bellas, by chaining filter() and select() together, and assigning the result. Check that the result looks as it should using head().

Code:

## Note that in this case we could filter and select in either order.
## For some wrangling problems, the order will matter, however.
Bellas <- babynames %>%
  filter(name == "Bella" & sex == "F") %>%
  select(year, name, sex, n)
head(Bellas)
## # A tibble: 6 x 4
##    year  name   sex     n
##   <dbl> <chr> <chr> <int>
## 1  1880 Bella     F    13
## 2  1881 Bella     F    24
## 3  1882 Bella     F    16
## 4  1883 Bella     F    17
## 5  1884 Bella     F    31
## 6  1885 Bella     F    25

Defining new variables with mutate()

Suppose we want to split the set of names into those that were “popular” in a given year, defined as being assigned to at least 1% of babies that year, and those that were not so popular. We can define a new binary variable based on the prop variable using mutate().

Code:

modified_babynames <- babynames %>%
  mutate(popular = prop >= 0.01)
head(modified_babynames)
## # A tibble: 6 x 6
##    year   sex      name     n       prop popular
##   <dbl> <chr>     <chr> <int>      <dbl>   <lgl>
## 1  1880     F      Mary  7065 0.07238433    TRUE
## 2  1880     F      Anna  2604 0.02667923    TRUE
## 3  1880     F      Emma  2003 0.02052170    TRUE
## 4  1880     F Elizabeth  1939 0.01986599    TRUE
## 5  1880     F    Minnie  1746 0.01788861    TRUE
## 6  1880     F  Margaret  1578 0.01616737    TRUE

If we decide the name a variable is given, we can replace it using the rename() function. For example, let’s rename popular to is_popular. This function has the following syntax: rename(dataset, newname1 = oldname1, newname2 = oldname2, ...)

Code

modified_babynames <- modified_babynames %>%
  rename(is_popular = popular)
head(modified_babynames)
## # A tibble: 6 x 6
##    year   sex      name     n       prop is_popular
##   <dbl> <chr>     <chr> <int>      <dbl>      <lgl>
## 1  1880     F      Mary  7065 0.07238433       TRUE
## 2  1880     F      Anna  2604 0.02667923       TRUE
## 3  1880     F      Emma  2003 0.02052170       TRUE
## 4  1880     F Elizabeth  1939 0.01986599       TRUE
## 5  1880     F    Minnie  1746 0.01788861       TRUE
## 6  1880     F  Margaret  1578 0.01616737       TRUE
  1. Create a dataset called PopularBabynames that includes only those names that were “popular” in the given year. Use the new is_popular variable to do the filtering, and then remove the variable from the filtered dataset since it is now a “constant”.

Solution:

PopularBabynames <- modified_babynames %>%
  filter(is_popular == TRUE) %>%
  select(-is_popular)
head(PopularBabynames)
## # A tibble: 6 x 5
##    year   sex      name     n       prop
##   <dbl> <chr>     <chr> <int>      <dbl>
## 1  1880     F      Mary  7065 0.07238433
## 2  1880     F      Anna  2604 0.02667923
## 3  1880     F      Emma  2003 0.02052170
## 4  1880     F Elizabeth  1939 0.01986599
## 5  1880     F    Minnie  1746 0.01788861
## 6  1880     F  Margaret  1578 0.01616737

Sorting data with arrange()

We can easily see at what point the largest share of births went to a single name by sorting the dataset by prop. We can use arrange() for this. To arrange in descending order so that the most popular name is at the top, use the desc() helper function around the variable name.

Code:

babynames %>%
  arrange(desc(prop)) %>%
  head()
## # A tibble: 6 x 5
##    year   sex    name     n       prop
##   <dbl> <chr>   <chr> <int>      <dbl>
## 1  1880     M    John  9655 0.08154630
## 2  1881     M    John  8769 0.08098299
## 3  1880     M William  9531 0.08049899
## 4  1883     M    John  8894 0.07907324
## 5  1881     M William  8524 0.07872038
## 6  1882     M    John  9557 0.07831617
  1. Describe precisely what the prop variable is telling us here. What does it mean for a name to be “first” in this list?

  2. Find the most popular name in your birth year.

  3. Find the birth year in which your name (or a name you like) was most popular, as measured by the share of babies given that name.

  4. Find the birth year in which the greatest number of babies were born with your name (or name of choice). Explain why it could be different than the year in the last question.

  5. Think about how you might address the following question, without actually trying to do it now: In which year was a particular name (pick any name) the most balanced between males and females; that is, when was the number of male and female births for that name closest to a 50/50 split?

Calculating summary statistics with summarize()

The summarize() verb works a little bit differently than the other four verbs. Whereas filter(), select(), mutate(), and arrange() take in a dataset where the rows are cases and the columns are variables and return a dataset in the same form, summarize() takes a dataset where the rows are cases and the columns are variables and returns a dataset with just one row (at least, when it is used by itself), where the columns are summary statistics (things like means, standard deviations, etc.) calculated from all the cases in the input.

Tip: When using summarize(), it is almost always desirable to return as one of the summary statistics the number of cases in the set being summarized. Among other things, this can be a quick way to alert you to errors. The n() function (called with no arguments) is a special helper function that does this.

NB: The babynames data contains a variable called n. Don’t confuse this variable with the function n(). In fact, to prevent confusion, rename the n variable to num_births

Code

modified_babynames <- modified_babynames %>%
  rename(num_births = n)
head(modified_babynames)
## # A tibble: 6 x 6
##    year   sex      name num_births       prop is_popular
##   <dbl> <chr>     <chr>      <int>      <dbl>      <lgl>
## 1  1880     F      Mary       7065 0.07238433       TRUE
## 2  1880     F      Anna       2604 0.02667923       TRUE
## 3  1880     F      Emma       2003 0.02052170       TRUE
## 4  1880     F Elizabeth       1939 0.01986599       TRUE
## 5  1880     F    Minnie       1746 0.01788861       TRUE
## 6  1880     F  Margaret       1578 0.01616737       TRUE

Suppose we want to find the year in which the name “Bella” hit its peak for females. We can take the Bellas dataset we created above, and use summarize() together with the max() function to answer this question.

Code:

Bellas <- modified_babynames %>%
  filter(name == "Bella" & sex == "F")
Bellas %>%
  summarize(num_rows = n(), most_bellas = max(num_births))
## # A tibble: 1 x 2
##   num_rows most_bellas
##      <int>       <dbl>
## 1      136        5112

Hmm… This tells us how many female Bellas were born in that name’s peak year for females, but it doesn’t actually tell us the year when that happened. We can use the which.max() function to get the relevant row number, and pull out the year in that position using square bracket element selection syntax, as follows.

Bellas %>%
  summarize(
    num_rows    = n(), 
    peak_number = max(num_births),
    peak_year   = year[which.max(num_births)])
## # A tibble: 1 x 3
##   num_rows peak_number peak_year
##      <int>       <dbl>     <dbl>
## 1      136        5112      2010
  1. Explain what the num_rows value tells us in context.

  2. Use summarize() instead of arrange() to redo exercises 3 and 4.

Piping Practice

And now for the cake-decorating portion of the lab. Just kidding.

Recall that when we write

dataset %>% verb(arguments)

this is equivalent to writing

verb(dataset, arguments)

More generally,

some_function(main_argument, other_arguments)

is rewritten as

main_argument %>% some_function(other_arguments)

With just one function it’s not clear that the pipe syntax is any clearer, but when we start chaining operations together, writing the verbs from left to right instead of from inside out (which is how we’d have to do it without the pipe) makes the code a whole lot easier to read.

  1. Re-write the following mess of a command, which displays a list of the top ten years in which male Colins were born in descending order of the number of male babies named Colin that year, using the pipe operator.
head(
  select(
    arrange(
      filter(
        modified_babynames, name == "Colin" & sex == "M"),
      desc(num_births)),
    year, num_births),
  n = 10)
## # A tibble: 10 x 2
##     year num_births
##    <dbl>      <int>
##  1  2004       5122
##  2  2003       4875
##  3  2005       4531
##  4  2006       3857
##  5  2008       3728
##  6  2009       3654
##  7  2007       3608
##  8  2010       3486
##  9  2002       3315
## 10  2011       3265

Possible solution:

modified_babynames %>%
  filter(name == "Colin" & sex == "M") %>%
  arrange(desc(num_births)) %>%
  select(year, num_births) %>%
  head()

Getting Credit

DM me in Slack with the following: * Your code to find the peak birth year for your (chosen) name in two ways: one using arrange() and one using summarize(). * What aspects of the data-wrangling we’ve done so far do you feel most comfortable with after doing this lab? * What aspects do you feel least comfortable with?