STAT 209: Lab 6

Five Data-Wrangling Verbs

Goal

Gain practice with the five “fundamental verbs” that are the building blocks in the “grammar of data wrangling”, as implemented in the dplyr package.

The verbs are:

filter()
select()
mutate()
arrange()
summarize()

Resources

You will probably want to look at the reference sheet or the slides from time to time. Remember that knowing how to look things up is an important skill! Nobody memorizes everything.

The Data

We’ll look some more at the babynames dataset for this lab. Let’s make sure it (as well as the tidyverse package) is loaded in our Markdown document.

Code:

library(tidyverse)
library(babynames)
# This is a slightly different dataset than on the last lab
data(babynames) 
babynames %>% 
  head(n = 5)

Extracting subsets with `filter()`

Recall that filter() allows us to extract a subset of cases from a dataset, by checking for a particular criterion.

Let’s extract “Bella”s who were recorded as female, and display the first few rows with head(). The code is available on the Knitted “partial solutions” version of the lab on the website if you need to refer to it, but see if you can do it without peeking first.

SOLUTION

If we specify multiple filter conditions separated by a comma or an ampersand (&), a case will only be included if it satisfies all of them. If we separate conditions with a vertical bar (|), a case will be included if it satisfies any of them. We can make more complex filters by putting parentheses around conjunctions or disjunctions of conditions like this (though not those involving commas – we’d need to use & if we want to do this with an “and” statement), and creating conjunctions or disjunctions of them.

For example, we could return a dataset consisting of the records about babies named “Joseph” or “Josephine” who were recorded as the opposite sex from the traditional association for those names:

Code:

babynames %>% 
  filter(
    (name == "Joseph" & sex == "F") | (name == "Josephine" & sex == "M")
  )

## # A tibble: 210 x 5
##     year sex   name          n      prop
##    <dbl> <chr> <chr>     <int>     <dbl>
##  1  1880 F     Joseph       10 0.000102 
##  2  1881 F     Joseph       10 0.000101 
##  3  1882 F     Joseph        6 0.0000519
##  4  1883 F     Joseph       17 0.000142 
##  5  1884 F     Joseph        9 0.0000654
##  6  1885 F     Joseph       14 0.0000986
##  7  1885 M     Josephine     6 0.0000518
##  8  1886 F     Joseph        8 0.0000520
##  9  1887 F     Joseph       13 0.0000836
## 10  1888 F     Joseph       18 0.0000950
## # … with 200 more rows

Selecting variables with `select()`

Recall that select() allows us to extract certain columns from a dataset, by listing each variable name we want to include as a separate argument, by listing each variable name we want to exclude, or by defining a condition for inclusion/exclusion.

Start with the full babynames dataset (so, not just Bellas) and display the first few rows, retaining only the year, name and n variables (again, it’s worth trying to do this before looking at my code).

SOLUTION

Now create a new dataset called Bellas that retains just year and n for the first few years of female Bellas, by chaining filter() and select() together with pipes, and assigning the result. Check that the result looks as it should using head() (but don’t restrict the Bellas dataset to the first few rows).

SOLUTION

Defining new variables with `mutate()`

Suppose we want to split the set of name/sex pairs into those that were “popular” in a given year, and those that were not so popular. We will define “popular” for this purpose as being a name that was assigned to at least 1% of all babies of a particular sex (as assigned according to the birth record) that year. The prop variable represents the proportion of births, out of all of those recorded for a given sex, that have the name in question.

Define a new binary variable from prop using mutate(), and store the resulting dataset in an new R object. The definition of the new variable after the = will be in the same form as the condition in a filter() expression. Conditional statements like this return TRUE or FALSE for each case they are evaluated on.

SOLUTION

If we decide to change the name a variable is given, we can replace it using the rename() function. For example, let’s rename popular to is_popular. This function has the following syntax:

rename(dataset, newname1 = oldname1, newname2 = oldname2, ...)

or preferably,

dataset %>% 
    rename(newname1 = oldname1, newname2 = oldname2, ...)

Using babynames_with_popular from the last exercise, use rename() to return a dataset in which popular is instead called is_popular. Store the new dataset in another R object. (We could be overwriting the original data as we go, but this can lead to issues when running chunks interactively if we don’t rerun previous chunks every time, because any time we run a chunk it will use the current version of an object)

SOLUTION

Create a dataset called PopularBabynames that includes only those names that were “popular” in the given year. Use the new is_popular variable to do the filtering, and then remove the variable from the filtered dataset using select() since it is now a “constant”.

SOLUTION

Sorting data with `arrange()`

We can easily see at what point the largest share of births (for a given sex) went to a single name by sorting the dataset by prop. We can use arrange() for this. To arrange in descending order so that the most popular name is at the top, use the desc() helper function around the variable name.

Code:

babynames %>%
  arrange(desc(prop)) %>%
  head()

## # A tibble: 6 x 5
##    year sex   name        n   prop
##   <dbl> <chr> <chr>   <int>  <dbl>
## 1  1880 M     John     9655 0.0815
## 2  1881 M     John     8769 0.0810
## 3  1880 M     William  9532 0.0805
## 4  1883 M     John     8894 0.0791
## 5  1881 M     William  8524 0.0787
## 6  1882 M     John     9557 0.0783

Describe precisely what the prop variable is telling us here. What does it mean for a name to be “first” in this list?

RESPONSE

Find the most popular names for male-recorded births and for female-recorded births in your birth year.

SOLUTION

Choose a name, sex pair you like (perhaps your own). Find the birth year when that name was most popular for babies recorded as that sex.

POSSIBLE SOLUTION

Find the birth year in which the greatest number of babies were born with your name and sex (or name and sex of choice). Explain why it could be different than the year in the last question.

SOLUTION

Calculating summary statistics with `summarize()`

The summarize() verb works a little bit differently than the other four verbs. Whereas filter(), select(), mutate(), and arrange() take in a dataset where the rows are cases and the columns are variables and return a dataset in the same form, summarize() takes a dataset where the rows are cases and the columns are variables and returns a dataset with just one row (at least, when it is used by itself), where the columns are summary statistics (things like means, standard deviations, etc.) calculated from all the cases in the input.

Tip: When using summarize(), it is almost always desirable to return as one of the summary statistics the number of cases in the set being summarized. Among other things, this can be a quick way to alert you to errors. The n() function (called with no arguments) is a special helper function that does this.

Note: The babynames data contains a variable called n. Don’t confuse this variable n with the function n(). In fact, to prevent confusion, let’s rename the n variable to num_births.

Code

babynames_no_n <- babynames %>%
  rename(num_births = n)
head(babynames_no_n)

## # A tibble: 6 x 5
##    year sex   name      num_births   prop
##   <dbl> <chr> <chr>          <int>  <dbl>
## 1  1880 F     Mary            7065 0.0724
## 2  1880 F     Anna            2604 0.0267
## 3  1880 F     Emma            2003 0.0205
## 4  1880 F     Elizabeth       1939 0.0199
## 5  1880 F     Minnie          1746 0.0179
## 6  1880 F     Margaret        1578 0.0162

Suppose we want to find the year in which the name “Bella” hit its peak for females. We could do this with arrange() and head(), or using summarize() together with a summary function that returns the value of one variable for the case when another variable is maximized. However, this is a common enough thing to want to do that there is a dedicated function for it, called slice_max().

It has arguments order_by= and n= to which we give the variable name we want to sort by and the number of cases we want to return.

Use slice_max() together with other wrangling verbs to produce a dataset that has just a single row and just two three columns: name, as well as peak_year and peak_count, which contain the year with the most female Bellas and the number of female Bellas recorded that year, respectively.

SOLUTION

Piping Practice

And now for the cake-decorating portion of the lab. Just kidding.

Recall that when we write

dataset %>% verb(arguments)

this is equivalent to writing

verb(dataset, arguments)

More generally,

some_function(main_argument, other_arguments)

is rewritten as

main_argument %>% some_function(other_arguments)

With just one function it’s not clear that the pipe syntax is any clearer, but when we start chaining operations together, writing the verbs from left to right instead of from inside out (which is how we’d have to do it without the pipe) makes the code a whole lot easier to read.

Re-write the following mess of a command, which displays a list of the top ten years in which male Colins were born in descending order of the number of male babies named Colin that year, using the pipe operator.

head(
  select(
    arrange(
      filter(
        babynames_no_n, name == "Colin" & sex == "M"),
      desc(num_births)),
    year, num_births),
  n = 10)

## # A tibble: 10 x 2
##     year num_births
##    <dbl>      <int>
##  1  2004       5122
##  2  2003       4876
##  3  2005       4531
##  4  2006       3858
##  5  2008       3728
##  6  2009       3655
##  7  2007       3608
##  8  2010       3486
##  9  2002       3315
## 10  2011       3265

SOLUTION

Think about how you might address the following question: In which year was a particular name (pick any name) the most balanced between males and females; that is, when was the number of male and female births for that name closest to a 50/50 split? You don’t need to write any code, just describe step-by-step what you would need to do. Explain why, given the way this dataset is structured at least, this is not a simple matter of filtering and sorting the data by an existing variable, nor is it something that either mutate() or summarize() can do on their own (at least, not without some ugly hacks).

GENERAL APPROACH (not code, just a procedure)

Post a comment in the #lab6 channel identifying the thing you found the most challenging about this lab, as well as (if you want) something you found interesting.

STAT 209: Lab 6

Colin Dawson

June 24, 2021

Five Data-Wrangling Verbs

Goal

Resources

The Data

Extracting subsets with `filter()`

SOLUTION

Selecting variables with `select()`

SOLUTION

SOLUTION

Defining new variables with `mutate()`

SOLUTION

SOLUTION

SOLUTION

Sorting data with `arrange()`

RESPONSE

SOLUTION

POSSIBLE SOLUTION

SOLUTION

Calculating summary statistics with `summarize()`

SOLUTION

Piping Practice

SOLUTION

GENERAL APPROACH (not code, just a procedure)

STAT 209: Lab 6

Colin Dawson

June 24, 2021

Five Data-Wrangling Verbs

Goal

Resources

The Data

Extracting subsets with filter()

SOLUTION

Selecting variables with select()

SOLUTION

SOLUTION

Defining new variables with mutate()

SOLUTION

SOLUTION

SOLUTION

Sorting data with arrange()

RESPONSE

SOLUTION

POSSIBLE SOLUTION

SOLUTION

Calculating summary statistics with summarize()

SOLUTION

Piping Practice

SOLUTION

GENERAL APPROACH (not code, just a procedure)

Extracting subsets with `filter()`

Selecting variables with `select()`

Defining new variables with `mutate()`

Sorting data with `arrange()`

Calculating summary statistics with `summarize()`