Gain practice with the five “fundamental verbs” that are the building blocks in the “grammar of data wrangling”, as implemented in the dplyr
package.
The verbs are:
filter()
select()
mutate()
arrange()
summarize()
You will probably want to look at the reference sheet or the slides from time to time. Remember that knowing how to look things up is an important skill! Nobody memorizes everything.
We’ll look some more at the babynames
dataset for this lab. Let’s make sure it (as well as the tidyverse
package) is loaded in our Markdown document.
Code:
filter()
Recall that filter()
allows us to extract a subset of cases from a dataset, by checking for a particular criterion.
head()
. The code is available on the Knitted “partial solutions” version of the lab on the website if you need to refer to it, but see if you can do it without peeking first.If we specify multiple filter conditions separated by a comma or an ampersand (&
), a case will only be included if it satisfies all of them. If we separate conditions with a vertical bar (|
), a case will be included if it satisfies any of them. We can make more complex filters by putting parentheses around conjunctions or disjunctions of conditions like this (though not those involving commas – we’d need to use &
if we want to do this with an “and” statement), and creating conjunctions or disjunctions of them.
For example, we could return a dataset consisting of the records about babies named “Joseph” or “Josephine” who were recorded as the opposite sex from the traditional association for those names:
Code:
## # A tibble: 210 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Joseph 10 0.000102
## 2 1881 F Joseph 10 0.000101
## 3 1882 F Joseph 6 0.0000519
## 4 1883 F Joseph 17 0.000142
## 5 1884 F Joseph 9 0.0000654
## 6 1885 F Joseph 14 0.0000986
## 7 1885 M Josephine 6 0.0000518
## 8 1886 F Joseph 8 0.0000520
## 9 1887 F Joseph 13 0.0000836
## 10 1888 F Joseph 18 0.0000950
## # … with 200 more rows
select()
Recall that select()
allows us to extract certain columns from a dataset, by listing each variable name we want to include as a separate argument, by listing each variable name we want to exclude, or by defining a condition for inclusion/exclusion.
babynames
dataset (so, not just Bellas) and display the first few rows, retaining only the year
, name
and n
variables (again, it’s worth trying to do this before looking at my code).Bellas
that retains just year
and n
for the first few years of female Bellas, by chaining filter()
and select()
together with pipes, and assigning the result. Check that the result looks as it should using head()
(but don’t restrict the Bellas
dataset to the first few rows).mutate()
Suppose we want to split the set of name/sex pairs into those that were “popular” in a given year, and those that were not so popular. We will define “popular” for this purpose as being a name that was assigned to at least 1% of all babies of a particular sex (as assigned according to the birth record) that year. The prop
variable represents the proportion of births, out of all of those recorded for a given sex, that have the name in question.
prop
using mutate()
, and store the resulting dataset in an new R object. The definition of the new variable after the =
will be in the same form as the condition in a filter()
expression. Conditional statements like this return TRUE
or FALSE
for each case they are evaluated on.If we decide to change the name a variable is given, we can replace it using the rename()
function. For example, let’s rename popular
to is_popular
. This function has the following syntax:
or preferably,
babynames_with_popular
from the last exercise, use rename()
to return a dataset in which popular
is instead called is_popular
. Store the new dataset in another R object. (We could be overwriting the original data as we go, but this can lead to issues when running chunks interactively if we don’t rerun previous chunks every time, because any time we run a chunk it will use the current version of an object)PopularBabynames
that includes only those names that were “popular” in the given year. Use the new is_popular
variable to do the filtering, and then remove the variable from the filtered dataset using select()
since it is now a “constant”.arrange()
We can easily see at what point the largest share of births (for a given sex) went to a single name by sorting the dataset by prop
. We can use arrange()
for this. To arrange in descending order so that the most popular name is at the top, use the desc()
helper function around the variable name.
Code:
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 M John 9655 0.0815
## 2 1881 M John 8769 0.0810
## 3 1880 M William 9532 0.0805
## 4 1883 M John 8894 0.0791
## 5 1881 M William 8524 0.0787
## 6 1882 M John 9557 0.0783
prop
variable is telling us here. What does it mean for a name to be “first” in this list?summarize()
The summarize()
verb works a little bit differently than the other four verbs. Whereas filter()
, select()
, mutate()
, and arrange()
take in a dataset where the rows are cases and the columns are variables and return a dataset in the same form, summarize()
takes a dataset where the rows are cases and the columns are variables and returns a dataset with just one row (at least, when it is used by itself), where the columns are summary statistics (things like means, standard deviations, etc.) calculated from all the cases in the input.
Tip: When using summarize()
, it is almost always desirable to return as one of the summary statistics the number of cases in the set being summarized. Among other things, this can be a quick way to alert you to errors. The n()
function (called with no arguments) is a special helper function that does this.
Note: The babynames
data contains a variable called n
. Don’t confuse this variable n with the function n()
. In fact, to prevent confusion, let’s rename the n
variable to num_births
.
Code
## # A tibble: 6 x 5
## year sex name num_births prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
Suppose we want to find the year in which the name “Bella” hit its peak for females. We could do this with arrange()
and head()
, or using summarize()
together with a summary function that returns the value of one variable for the case when another variable is maximized. However, this is a common enough thing to want to do that there is a dedicated function for it, called slice_max()
.
It has arguments order_by=
and n=
to which we give the variable name we want to sort by and the number of cases we want to return.
slice_max()
together with other wrangling verbs to produce a dataset that has just a single row and just two three columns: name
, as well as peak_year
and peak_count
, which contain the year with the most female Bellas and the number of female Bellas recorded that year, respectively.And now for the cake-decorating portion of the lab. Just kidding.
Recall that when we write
dataset %>% verb(arguments)
this is equivalent to writing
verb(dataset, arguments)
More generally,
some_function(main_argument, other_arguments)
is rewritten as
main_argument %>% some_function(other_arguments)
With just one function it’s not clear that the pipe syntax is any clearer, but when we start chaining operations together, writing the verbs from left to right instead of from inside out (which is how we’d have to do it without the pipe) makes the code a whole lot easier to read.
head(
select(
arrange(
filter(
babynames_no_n, name == "Colin" & sex == "M"),
desc(num_births)),
year, num_births),
n = 10)
## # A tibble: 10 x 2
## year num_births
## <dbl> <int>
## 1 2004 5122
## 2 2003 4876
## 3 2005 4531
## 4 2006 3858
## 5 2008 3728
## 6 2009 3655
## 7 2007 3608
## 8 2010 3486
## 9 2002 3315
## 10 2011 3265
mutate()
or summarize()
can do on their own (at least, not without some ugly hacks).#lab6
channel identifying the thing you found the most challenging about this lab, as well as (if you want) something you found interesting.