Learn to identify repetition in code that could be made more concise by writing a function or an iterative construct, and learn to write such things in R.
One of these days, we’ll work with some different data, I promise.
Load the packages and data:
One of the questions we have been interested in when working with the baby names data is: “In what year did the name
For the name Colin, for example, we can answer this question with the following pipeline:
Code:
total_births_by_year <- babynames %>%
group_by(year) %>%
summarize(total_births = sum(n))
babynames %>%
filter(name == "Colin") %>%
group_by(year) %>%
summarize(total_for_name = sum(n)) %>%
left_join(total_births_by_year) %>%
mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
slice_max(order_by = percentage_for_year, n = 1) %>%
select(year, percentage_for_year)
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2004 0.135
So, the name “Colin” has never been a more popular choice for new babies than it was in 2004, being given to (approximately) 0.13% of all babies, regardless of sex.
If I want to get the same result for a different name, say “Fred”, I could just copy and paste the above code and change the name. But,
Instead, I can write a function that captures the “template” for this calculation, and lets me instantiate that template with whatever specific input I want.
What are the inputs to this function? If I always want to return the single peak year, there’s just one input: the name. So I can write:
Code:
peak_year_for_name <- function(name_of_interest)
{
total_births_by_year <- babynames %>%
group_by(year) %>%
summarize(total_births = sum(n))
babynames %>%
filter(name == name_of_interest) %>%
group_by(year) %>%
summarize(total_for_name = sum(n)) %>%
left_join(total_births_by_year) %>%
mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
slice_max(order_by = percentage_for_year, n = 1) %>%
select(year, percentage_for_year)
}
The body of the function is exactly the code we wrote before, but instead of hardcoding “Colin”, the name in the filter is replaced by the name_of_interest
argument (whatever its value ends up being when the function is called).
Now I can just run this function, plugging in whatever name I want, and I quickly get results. Here are some results for various members of my own family:
Code:
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2004 0.135
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 1985 0.563
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 1951 0.383
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 1891 4.10
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2017 0.0317
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2002 0.00289
Most functions are designed to work with certain kinds of inputs. For example, name_of_interest
in the above should be a quoted text string, not a number, not a data frame, etc.
In some languages, when you write a function, you explicitly encode that your function must take a certain kind of input. In R, you don’t do that; R is what’s called a “dynamically typed” language, in which functions will accept whatever input you give them, and if what they do happens to work for that input (even if it’s not something the author envisioned), it will do it; otherwise you’ll get an error somewhere in the execution of the function.
As I’m sure you’ve seen, this flexibility can sometimes make it difficult to track down what is causing an error in R, and so it is worth trying to avoid this sort of thing by including some documentation at the top of your function indicating what type of input you intend the function to be used with. The user of the function is free to violate that intention, but at least they go in with their eyes open.
In R, you can type formals(my_function_name)
to see at least the names of the arguments to a function. For example:
Code
## $name_of_interest
we see that peak_year_for_name
takes one argument, called name_of_interest
.
For functions that are part of an R package, documentation is viewable with the ?function_name
syntax (or, equivalently, with help(function_name)
). This of course won’t exist for custom functions you’ve written.
The “value” of a function (the thing it returns, if, for example, you are assigning its result to a variable) is, by default, the return value of the last command executed by the function. In our function there is only one command (which consists of several component commands connected in a pipeline), and so the return value is the return value of the pipeline.
If we wanted to be more explicit, we could assign the result of the pipeline to a variable (we might call it result
), and add the line return(result)
at the end of our function.
It’s a good idea to do this if your function contains more than one line, to make it clear which part of the function body is the return value. For one-liners (and maybe some very simple multi liners), it’s a judgment call as to whether it makes it clearer to do this or not.
In “statically typed” languages (the kind where you specify the types that need to be passed in for each argument, like C++, Java, etc), part of the signature of a function is the type of thing that it returns. In dynamically typed languages (like R, Python, etc), the type of the return value could well depend on the types of the arguments provided. But, again, it is a good idea to document the intended return type.
Often times, we want to allow our functions to be flexible, by allowing the user to alter several aspects of what it does. We make our functions more flexible by adding more inputs, each of which constitutes a “degree of freedom” for our function. But if most use cases involve sensible defaults, it is cumbersome to force the user to input these defaults every time they use the function.
We can have the “best of both worlds” (flexibility without cumbersome function calls) by using default argument values.
For example, I could make my peak_year_for_name
function more flexible by having the function return the most popular n
years:
Code:
peak_years_for_name <- function(name_of_interest, n_years)
{
total_births_by_year <- babynames %>%
group_by(year) %>%
summarize(total_births = sum(n))
babynames %>%
filter(name == name_of_interest) %>%
group_by(year) %>%
summarize(total_for_name = sum(n)) %>%
left_join(total_births_by_year) %>%
mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
slice_max(order_by = percentage_for_year, n = n_years) %>%
select(year, percentage_for_year)
}
As written, this function now requires the user to specify a number of years. The following will produce an error, since I haven’t supplied the n_years
argument.
Code:
If we think that most often the user will just want to see the single most popular year, I can give that second argument a default value that makes the above work as before.
Function (Re-)definition:
peak_years_for_name <- function(name_of_interest, n_years = 1)
{
total_births_by_year <- babynames %>%
group_by(year) %>%
summarize(total_births = sum(n))
babynames %>%
filter(name == name_of_interest) %>%
group_by(year) %>%
summarize(total_for_name = sum(n)) %>%
left_join(total_births_by_year) %>%
mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
slice_max(order_by = percentage_for_year, n = n_years) %>%
select(year, percentage_for_year)
}
Some function calls:
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2004 0.135
## # A tibble: 5 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2004 0.135
## 2 2003 0.129
## 3 2005 0.118
## 4 2006 0.0977
## 5 2009 0.0958
## # A tibble: 5 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2004 0.135
## 2 2003 0.129
## 3 2005 0.118
## 4 2006 0.0977
## 5 2009 0.0958
You might have noticed that in our function we hardcoded the dataset to be babynames
. If we had tried to call this function without having run library(babynames)
above, we’d get an error, since babynames
would not then be defined. If you “unload” the babynames
package (undoing the effect of library()
) and then try to call the function, R will complain that the babynames
data doesn’t exist.
Code:
rm(babynames) # remove the dataset from the environment
detach("package:babynames", unload = TRUE) # unload the package
peak_years_for_name("Colin") ## ERRRORRRR
Someone calling this function would have no easy way of knowing why this happened, since their function call didn’t refer to that dataset; for that reason among others, it’s not great coding practice to hardcode things inside a function like that.
(Let’s make sure to bring back the babynames
library for later)
Code:
How does R know where to look for definitions of things that are referenced in a function? A complete answer would involve a lot of caveats, but for the most part, R will first look inside the function for a definition (at its arguments, and at anything that is created within the function itself), and if it doesn’t find anything, it will look in the “global” environment (that is, at stuff that was defined or loaded into the environment by previous assignments or calls).
In theory we could have made a dataset
argument to our peak_years_for_name()
function so that it didn’t depend on something defined in the global environment:
Code:
peak_year_for_name <- function(dataset, name_of_interest, n_years = 1)
{
total_births_by_year <- dataset %>%
group_by(year) %>%
summarize(total_births = sum(n))
dataset %>%
filter(name == name_of_interest) %>%
group_by(year) %>%
summarize(total_for_name = sum(n)) %>%
left_join(total_births_by_year) %>%
mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
slice_max(order_by = percentage_for_year, n = n_years) %>%
select(year, percentage_for_year)
}
Notice that we still have hardcoded variable names here, so this function will only work if the dataset we provide has the right columns, but this can be useful if we are going to work with (say) different subsets of a dataset that we obtain by filter()
ing:
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 1921 3.18
The following exercises involve writing functions designed to tell us things about flights, using the nycflights13
package. Load it first:
Recall that this package provides the dataset flights
about individual flights, and the datasets airports
and planes
about… those things (as well as a couple others).
top_n_destinations
that takes a dataset
argument, a carrierID
argument, an origin_airport
argument, and an n_destinations
argument, and retrieves the n_destinations
most common airport destinations (dest
s) of flights taking off from the airport whose code is provided in the origin_airport
argument, and how often the carrier flew there.DL
) flights from JFK (one of the three airports in New York City) using the flights
dataset.AA
) flights from JFK.dataset
, an origin_code
and a destination_code
(e.g. JFK to LAX), will retrieve the n_carriers
carriers with the most flights from the origin to the destination, along with the average arrival delay time for those flights.Computers are excellent at repetition, as long as you tell them precisely what to repeat.
Remember the example above where I called my function on a bunch of names of people in my family? I can make that even more efficient by creating the list of names I’m interested in up front, and then telling the computer “Call this function on each one of these names, and return the results”.
In R, the lapply()
(short for “list apply”) is useful for this sort of thing, provided the list of argument values goes with the first argument of my function.
Code:
my_name_list <- c("Colin", "Megan", "Bruce", "Mary", "Arlo", "Esai")
lapply(my_name_list, FUN = peak_year_for_name, dataset = babynames)
## [[1]]
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2004 0.135
##
## [[2]]
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 1985 0.563
##
## [[3]]
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 1951 0.383
##
## [[4]]
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 1891 4.10
##
## [[5]]
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2017 0.0317
##
## [[6]]
## # A tibble: 1 x 2
## year percentage_for_year
## <dbl> <dbl>
## 1 2002 0.00289
Notice that the argument I wanted to vary from one call to the next went in the X
position for lapply()
, whereas the argument(s) that stayed constant were provided to lapply()
using their names. Now, this result is a bit inelegant; the function always returns a data frame with a single entry. Wouldn’t it be nice if we could “stack” these into a single data frame?
We can! The bind_rows()
function will do this for us. Examine the results of the following code after each step, to make sure you understand what’s happening.
my_name_list %>%
lapply(FUN = peak_year_for_name, dataset = babynames) %>%
bind_rows() %>%
mutate(name = my_name_list) %>%
select(name, year, percentage_for_year)
## # A tibble: 6 x 3
## name year percentage_for_year
## <chr> <dbl> <dbl>
## 1 Colin 2004 0.135
## 2 Megan 1985 0.563
## 3 Bruce 1951 0.383
## 4 Mary 1891 4.10
## 5 Arlo 2017 0.0317
## 6 Esai 2002 0.00289
(this time I passed lapply
its first (X
) argument via a pipe, but it would have been equivalent to put my_name_list
first inside the parens instead)
If you have programmed in another language before, you likely would have handled something like this using a “loop” such as a for
loop. You can write for loops in R, but it is more “idiomatic” to use the above sort of “apply” construct; and in certain cases it’s more efficient too (which is important when there are a lot of iterations involved).
If you find yourself wanting a for
loop, ask yourself whether you could handle what you wanted to do with a function whose first argument is the thing you want to iterate over.
lapply()
and the function that you wrote in Exercise 1 to find the top airport destination for Delta (DL
), American (AA
), and United (UA
).lapply()
and the function that you wrote in Exercise 4 to find the carriers with the most flights from JFK to Chicago O’Hare (ORD
), Los Angeles International (LAX
), and San Francisco International (SFO
) airports, respectively. In order to have the destination shown in the output, you will need to do a couple of things: First, the vector of destinations that you pass to lapply()
should have named entries (it should be in the form c(name1 = value1, name2 = value2, ...)
, where name1
, name2
, etc are the labels you want displayed in the output, and value1
, value2
, etc are the actual argument values you’re passing to your function). Second, set .id = "destination"
as an argument to bind_rows()
so that the output includes a column called destination
that contains name1
, name2
, etc.The following function computes the top 10 most popular names in the dataset passed to it via the dataset
argument:
Code:
top_n_names <- function(dataset, n_returned)
{
overall_total <- dataset %>%
summarize(total_births = sum(n)) %>%
pull(total_births)
dataset %>%
group_by(name) %>%
summarize(
total_for_name = sum(n),
percent_for_name = total_for_name / overall_total * 100) %>%
slice_max(
order_by = total_for_name,
n = n_returned) %>%
rownames_to_column(var = "rank")
}
Here we use it to find the top 10 names for babies born in 2000.
## # A tibble: 10 x 4
## rank name total_for_name percent_for_name
## <chr> <chr> <int> <dbl>
## 1 1 Jacob 34530 0.914
## 2 2 Michael 32149 0.851
## 3 3 Matthew 28616 0.757
## 4 4 Joshua 27592 0.730
## 5 5 Emily 25983 0.688
## 6 6 Christopher 24981 0.661
## 7 7 Nicholas 24691 0.654
## 8 8 Andrew 23684 0.627
## 9 9 Hannah 23105 0.612
## 10 10 Joseph 22849 0.605
If we want to apply this function to find the most popular name in a particular decade, we could simply filter
our data to keep only years in the range of interest, and call the function on the filtered data.
But suppose we want to do this for every decade in the 20th century. We could theoretically create 10 datasets, put them in a list, and use lapply
on the list of datasets. But it’s simpler to take advantage of the do()
function for this. This is seen most easily by example:
Code:
## The floor() function rounds down to the nearest integer
top_by_decade <- babynames %>%
mutate(decade = 10 * floor(year / 10)) %>%
group_by(decade) %>%
do(
top_n_names(
dataset = .,
n_returned = 10))
## The period is a placeholder for "each dataset in the list"
top_by_decade
## # A tibble: 140 x 5
## # Groups: decade [14]
## decade rank name total_for_name percent_for_name
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 1 Mary 92030 3.82
## 2 1880 2 John 90395 3.75
## 3 1880 3 William 85246 3.54
## 4 1880 4 James 54323 2.26
## 5 1880 5 George 47980 1.99
## 6 1880 6 Charles 46879 1.95
## 7 1880 7 Anna 38320 1.59
## 8 1880 8 Frank 31135 1.29
## 9 1880 9 Joseph 26404 1.10
## 10 1880 10 Emma 25512 1.06
## # … with 130 more rows
Note that since top_n_names()
returns a data frame with n_returned
rows (for whatever value of n_returned
we supply when we call the function), the result of this operation is a big “stacked” data frame with n_returned
names per decade.
Note: If you have worked with the mosaic
package, you likely used another function called do()
. It’s related to the dplyr
one, but not identical, so if you are working in an R session with both packages loaded, it’s a good idea to be explicit about which one you want to be using. You can do this by writing either dplyr::do()
or mosaic::do()
.
do()
with your top_n_destinations
function from Exercise 1 to find the top destination for each airline flights from JFK
in each month of 2013. Since carrierID
is an input to top_n_destinations
, you’ll probably want to pull
the carrier
column from each grouped dataset using pull(., carrier)
in your call to top_n_destinations
.