Five Data-Wrangling Verbs

Goal

Gain practice with the five “fundamental verbs” that are the building blocks in the “grammar of data wrangling”, as implemented in the dplyr package.

The verbs are:

  • filter()
  • select()
  • mutate()
  • arrange()
  • summarize()

Resources

You will probably want to look at the reference sheet or the slides from time to time. Remember that knowing how to look things up is an important skill! Nobody memorizes everything.

The Data

We’ll look some more at the babynames dataset for this lab. Let’s make sure it (as well as the tidyverse package) is loaded in our Markdown document.

Code:

Extracting subsets with filter()

Recall that filter() allows us to extract a subset of cases from a dataset, by checking for a particular criterion.

  1. Let’s extract “Bella”s who were recorded as female, and display the first few rows with head(). The code is available on the Knitted “partial solutions” version of the lab on the website if you need to refer to it, but see if you can do it without peeking first.

SOLUTION
## # A tibble: 6 x 5
##    year sex   name      n     prop
##   <dbl> <chr> <chr> <int>    <dbl>
## 1  1880 F     Bella    13 0.000133
## 2  1881 F     Bella    24 0.000243
## 3  1882 F     Bella    16 0.000138
## 4  1883 F     Bella    17 0.000142
## 5  1884 F     Bella    31 0.000225
## 6  1885 F     Bella    25 0.000176

If we specify multiple filter conditions separated by a comma or an ampersand (&), a case will only be included if it satisfies all of them. If we separate conditions with a vertical bar (|), a case will be included if it satisfies any of them. We can make more complex filters by putting parentheses around conjunctions or disjunctions of conditions like this (though not those involving commas – we’d need to use & if we want to do this with an “and” statement), and creating conjunctions or disjunctions of them.

For example, we could return a dataset consisting of the records about babies named “Joseph” or “Josephine” who were recorded as the opposite sex from the traditional association for those names:

Code:

## # A tibble: 210 x 5
##     year sex   name          n      prop
##    <dbl> <chr> <chr>     <int>     <dbl>
##  1  1880 F     Joseph       10 0.000102 
##  2  1881 F     Joseph       10 0.000101 
##  3  1882 F     Joseph        6 0.0000519
##  4  1883 F     Joseph       17 0.000142 
##  5  1884 F     Joseph        9 0.0000654
##  6  1885 F     Joseph       14 0.0000986
##  7  1885 M     Josephine     6 0.0000518
##  8  1886 F     Joseph        8 0.0000520
##  9  1887 F     Joseph       13 0.0000836
## 10  1888 F     Joseph       18 0.0000950
## # … with 200 more rows

Selecting variables with select()

Recall that select() allows us to extract certain columns from a dataset, by listing each variable name we want to include as a separate argument, by listing each variable name we want to exclude, or by defining a condition for inclusion/exclusion.

  1. Start with the full babynames dataset (so, not just Bellas) and display the first few rows, retaining only the year, name and n variables (again, it’s worth trying to do this before looking at my code).

SOLUTION
## # A tibble: 6 x 3
##    year name          n
##   <dbl> <chr>     <int>
## 1  1880 Mary       7065
## 2  1880 Anna       2604
## 3  1880 Emma       2003
## 4  1880 Elizabeth  1939
## 5  1880 Minnie     1746
## 6  1880 Margaret   1578

  1. Now create a new dataset called Bellas that retains just year and n for the first few years of female Bellas, by chaining filter() and select() together with pipes, and assigning the result. Check that the result looks as it should using head() (but don’t restrict the Bellas dataset to the first few rows).

Defining new variables with mutate()

Suppose we want to split the set of name/sex pairs into those that were “popular” in a given year, and those that were not so popular. We will define “popular” for this purpose as being a name that was assigned to at least 1% of all babies of a particular sex (as assigned according to the birth record) that year. The prop variable represents the proportion of births, out of all of those recorded for a given sex, that have the name in question.

  1. Define a new binary variable from prop using mutate(), and store the resulting dataset in an new R object. The definition of the new variable after the = will be in the same form as the condition in a filter() expression. Conditional statements like this return TRUE or FALSE for each case they are evaluated on.

SOLUTION
## # A tibble: 6 x 6
##    year sex   name          n   prop popular
##   <dbl> <chr> <chr>     <int>  <dbl> <lgl>  
## 1  1880 F     Mary       7065 0.0724 TRUE   
## 2  1880 F     Anna       2604 0.0267 TRUE   
## 3  1880 F     Emma       2003 0.0205 TRUE   
## 4  1880 F     Elizabeth  1939 0.0199 TRUE   
## 5  1880 F     Minnie     1746 0.0179 TRUE   
## 6  1880 F     Margaret   1578 0.0162 TRUE

If we decide to change the name a variable is given, we can replace it using the rename() function. For example, let’s rename popular to is_popular. This function has the following syntax:

or preferably,

  1. Using babynames_with_popular from the last exercise, use rename() to return a dataset in which popular is instead called is_popular. Store the new dataset in another R object. (We could be overwriting the original data as we go, but this can lead to issues when running chunks interactively if we don’t rerun previous chunks every time, because any time we run a chunk it will use the current version of an object)

SOLUTION
## # A tibble: 6 x 6
##    year sex   name          n   prop is_popular
##   <dbl> <chr> <chr>     <int>  <dbl> <lgl>     
## 1  1880 F     Mary       7065 0.0724 TRUE      
## 2  1880 F     Anna       2604 0.0267 TRUE      
## 3  1880 F     Emma       2003 0.0205 TRUE      
## 4  1880 F     Elizabeth  1939 0.0199 TRUE      
## 5  1880 F     Minnie     1746 0.0179 TRUE      
## 6  1880 F     Margaret   1578 0.0162 TRUE

  1. Create a dataset called PopularBabynames that includes only those names that were “popular” in the given year. Use the new is_popular variable to do the filtering, and then remove the variable from the filtered dataset using select() since it is now a “constant”.

SOLUTION
## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

Sorting data with arrange()

We can easily see at what point the largest share of births (for a given sex) went to a single name by sorting the dataset by prop. We can use arrange() for this. To arrange in descending order so that the most popular name is at the top, use the desc() helper function around the variable name.

Code:

## # A tibble: 6 x 5
##    year sex   name        n   prop
##   <dbl> <chr> <chr>   <int>  <dbl>
## 1  1880 M     John     9655 0.0815
## 2  1881 M     John     8769 0.0810
## 3  1880 M     William  9532 0.0805
## 4  1883 M     John     8894 0.0791
## 5  1881 M     William  8524 0.0787
## 6  1882 M     John     9557 0.0783
  1. Describe precisely what the prop variable is telling us here. What does it mean for a name to be “first” in this list?

RESPONSE

The prop variable tells us the proportion of babies with a particular sex assigned at birth that had the name in question. So by sorting on that variable we see the most popular names for each sex (as a share of births for that sex)


  1. Find the most popular names for male-recorded births and for female-recorded births in your birth year.

SOLUTION
## # A tibble: 7,364 x 5
##     year sex   name            n   prop
##    <dbl> <chr> <chr>       <int>  <dbl>
##  1  1982 M     Michael     68228 0.0362
##  2  1982 M     Christopher 59225 0.0314
##  3  1982 M     Matthew     46060 0.0244
##  4  1982 M     Jason       40624 0.0215
##  5  1982 M     David       40451 0.0214
##  6  1982 M     James       38872 0.0206
##  7  1982 M     Joshua      38027 0.0202
##  8  1982 M     John        34699 0.0184
##  9  1982 M     Robert      34421 0.0182
## 10  1982 M     Daniel      32653 0.0173
## # … with 7,354 more rows

  1. Choose a name, sex pair you like (perhaps your own). Find the birth year when that name was most popular for babies recorded as that sex.

POSSIBLE SOLUTION
## # A tibble: 127 x 5
##     year sex   name      n    prop
##    <dbl> <chr> <chr> <int>   <dbl>
##  1  2004 M     Colin  5122 0.00242
##  2  2003 M     Colin  4876 0.00232
##  3  2005 M     Colin  4531 0.00213
##  4  2006 M     Colin  3858 0.00176
##  5  2009 M     Colin  3655 0.00172
##  6  2008 M     Colin  3728 0.00171
##  7  2010 M     Colin  3486 0.00170
##  8  2007 M     Colin  3608 0.00163
##  9  2011 M     Colin  3265 0.00161
## 10  2002 M     Colin  3315 0.00160
## # … with 117 more rows

  1. Find the birth year in which the greatest number of babies were born with your name and sex (or name and sex of choice). Explain why it could be different than the year in the last question.

SOLUTION
## # A tibble: 127 x 5
##     year sex   name      n    prop
##    <dbl> <chr> <chr> <int>   <dbl>
##  1  2004 M     Colin  5122 0.00242
##  2  2003 M     Colin  4876 0.00232
##  3  2005 M     Colin  4531 0.00213
##  4  2006 M     Colin  3858 0.00176
##  5  2008 M     Colin  3728 0.00171
##  6  2009 M     Colin  3655 0.00172
##  7  2007 M     Colin  3608 0.00163
##  8  2010 M     Colin  3486 0.00170
##  9  2002 M     Colin  3315 0.00160
## 10  2011 M     Colin  3265 0.00161
## # … with 117 more rows

In my case it’s the same, but since the number of births is changing over time, it might be that a name is relatively less popular in a year with more births of a particular sex than in a year with fewer births for that sex, and so the proportion is lower, but the absolute number is higher because of the higher total number of births that year.

Calculating summary statistics with summarize()

The summarize() verb works a little bit differently than the other four verbs. Whereas filter(), select(), mutate(), and arrange() take in a dataset where the rows are cases and the columns are variables and return a dataset in the same form, summarize() takes a dataset where the rows are cases and the columns are variables and returns a dataset with just one row (at least, when it is used by itself), where the columns are summary statistics (things like means, standard deviations, etc.) calculated from all the cases in the input.

Tip: When using summarize(), it is almost always desirable to return as one of the summary statistics the number of cases in the set being summarized. Among other things, this can be a quick way to alert you to errors. The n() function (called with no arguments) is a special helper function that does this.

Note: The babynames data contains a variable called n. Don’t confuse this variable n with the function n(). In fact, to prevent confusion, let’s rename the n variable to num_births.

Code

## # A tibble: 6 x 5
##    year sex   name      num_births   prop
##   <dbl> <chr> <chr>          <int>  <dbl>
## 1  1880 F     Mary            7065 0.0724
## 2  1880 F     Anna            2604 0.0267
## 3  1880 F     Emma            2003 0.0205
## 4  1880 F     Elizabeth       1939 0.0199
## 5  1880 F     Minnie          1746 0.0179
## 6  1880 F     Margaret        1578 0.0162

Suppose we want to find the year in which the name “Bella” hit its peak for females. We could do this with arrange() and head(), or using summarize() together with a summary function that returns the value of one variable for the case when another variable is maximized. However, this is a common enough thing to want to do that there is a dedicated function for it, called slice_max().

It has arguments order_by= and n= to which we give the variable name we want to sort by and the number of cases we want to return.

  1. Use slice_max() together with other wrangling verbs to produce a dataset that has just a single row and just two three columns: name, as well as peak_year and peak_count, which contain the year with the most female Bellas and the number of female Bellas recorded that year, respectively.

SOLUTION

Piping Practice

And now for the cake-decorating portion of the lab. Just kidding.

Recall that when we write

dataset %>% verb(arguments)

this is equivalent to writing

verb(dataset, arguments)

More generally,

some_function(main_argument, other_arguments)

is rewritten as

main_argument %>% some_function(other_arguments)

With just one function it’s not clear that the pipe syntax is any clearer, but when we start chaining operations together, writing the verbs from left to right instead of from inside out (which is how we’d have to do it without the pipe) makes the code a whole lot easier to read.

  1. Re-write the following mess of a command, which displays a list of the top ten years in which male Colins were born in descending order of the number of male babies named Colin that year, using the pipe operator.
## # A tibble: 10 x 2
##     year num_births
##    <dbl>      <int>
##  1  2004       5122
##  2  2003       4876
##  3  2005       4531
##  4  2006       3858
##  5  2008       3728
##  6  2009       3655
##  7  2007       3608
##  8  2010       3486
##  9  2002       3315
## 10  2011       3265

SOLUTION

  1. Think about how you might address the following question: In which year was a particular name (pick any name) the most balanced between males and females; that is, when was the number of male and female births for that name closest to a 50/50 split? You don’t need to write any code, just describe step-by-step what you would need to do. Explain why, given the way this dataset is structured at least, this is not a simple matter of filtering and sorting the data by an existing variable, nor is it something that either mutate() or summarize() can do on their own (at least, not without some ugly hacks).

GENERAL APPROACH (not code, just a procedure)


  1. Post a comment in the #lab6 channel identifying the thing you found the most challenging about this lab, as well as (if you want) something you found interesting.