Making Your Code More Modular

Goal

Learn to identify repetition in code that could be made more concise by writing a function or an iterative construct, and learn to write such things in R.

The Data

One of these days, we’ll work with some different data, I promise.

Load the packages and data:

library(tidyverse)
library(babynames)
data(babynames)    ## SSA data

A repetitive task

One of the questions we have been interested in when working with the baby names data is: “In what year did the name reach its peak in popularity?”

For the name Colin, for example, we can answer this question with the following pipeline:

Code:

total_births_by_year <- babynames %>%
  group_by(year) %>%
  summarize(total_births = sum(n))
babynames %>%
  filter(name == "Colin") %>%
  group_by(year) %>%
  summarize(total_for_name = sum(n)) %>%
  left_join(total_births_by_year) %>%
  mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
  slice_max(order_by = percentage_for_year, n = 1) %>%
  select(year, percentage_for_year)

## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2004               0.135

So, the name “Colin” has never been a more popular choice for new babies than it was in 2004, being given to (approximately) 0.13% of all babies, regardless of sex.

Writing a function

If I want to get the same result for a different name, say “Fred”, I could just copy and paste the above code and change the name. But,

This is annoying
This makes my code harder to read
If I want to change something (for instance, I decide I don’t want to combine sexes after all), I have to go through and change it in every place.

Instead, I can write a function that captures the “template” for this calculation, and lets me instantiate that template with whatever specific input I want.

What are the inputs to this function? If I always want to return the single peak year, there’s just one input: the name. So I can write:

Code:

peak_year_for_name <- function(name_of_interest) 
{
  total_births_by_year <- babynames %>%
    group_by(year) %>%
    summarize(total_births = sum(n))
  babynames %>%
    filter(name == name_of_interest) %>%
    group_by(year) %>%
    summarize(total_for_name = sum(n)) %>%
    left_join(total_births_by_year) %>%
    mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
    slice_max(order_by = percentage_for_year, n = 1) %>%
    select(year, percentage_for_year)  
}

The body of the function is exactly the code we wrote before, but instead of hardcoding “Colin”, the name in the filter is replaced by the name_of_interest argument (whatever its value ends up being when the function is called).

Now I can just run this function, plugging in whatever name I want, and I quickly get results. Here are some results for various members of my own family:

Code:

peak_year_for_name("Colin")  # me             (born 1980s)

## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2004               0.135

peak_year_for_name("Megan")  # my sister      (born 1980s)

## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  1985               0.563

peak_year_for_name("Bruce")  # my father      (born 1950s)

## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  1951               0.383

peak_year_for_name("Mary")   # my mother      (born 1950s)

## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  1891                4.10

peak_year_for_name("Arlo")   # my 9-year-old  (born 2010s)

## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2017              0.0317

peak_year_for_name("Esai")   # my 6-year-old  (born 2010s)

## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2002             0.00289

Function signatures

Most functions are designed to work with certain kinds of inputs. For example, name_of_interest in the above should be a quoted text string, not a number, not a data frame, etc.

In some languages, when you write a function, you explicitly encode that your function must take a certain kind of input. In R, you don’t do that; R is what’s called a “dynamically typed” language, in which functions will accept whatever input you give them, and if what they do happens to work for that input (even if it’s not something the author envisioned), it will do it; otherwise you’ll get an error somewhere in the execution of the function.

As I’m sure you’ve seen, this flexibility can sometimes make it difficult to track down what is causing an error in R, and so it is worth trying to avoid this sort of thing by including some documentation at the top of your function indicating what type of input you intend the function to be used with. The user of the function is free to violate that intention, but at least they go in with their eyes open.

In R, you can type formals(my_function_name) to see at least the names of the arguments to a function. For example:

Code

formals(peak_year_for_name)

## $name_of_interest

we see that peak_year_for_name takes one argument, called name_of_interest.

For functions that are part of an R package, documentation is viewable with the ?function_name syntax (or, equivalently, with help(function_name)). This of course won’t exist for custom functions you’ve written.

Return values

The “value” of a function (the thing it returns, if, for example, you are assigning its result to a variable) is, by default, the return value of the last command executed by the function. In our function there is only one command (which consists of several component commands connected in a pipeline), and so the return value is the return value of the pipeline.

If we wanted to be more explicit, we could assign the result of the pipeline to a variable (we might call it result), and add the line return(result) at the end of our function.

It’s a good idea to do this if your function contains more than one line, to make it clear which part of the function body is the return value. For one-liners (and maybe some very simple multi liners), it’s a judgment call as to whether it makes it clearer to do this or not.

In “statically typed” languages (the kind where you specify the types that need to be passed in for each argument, like C++, Java, etc), part of the signature of a function is the type of thing that it returns. In dynamically typed languages (like R, Python, etc), the type of the return value could well depend on the types of the arguments provided. But, again, it is a good idea to document the intended return type.

Default arguments

Often times, we want to allow our functions to be flexible, by allowing the user to alter several aspects of what it does. We make our functions more flexible by adding more inputs, each of which constitutes a “degree of freedom” for our function. But if most use cases involve sensible defaults, it is cumbersome to force the user to input these defaults every time they use the function.

We can have the “best of both worlds” (flexibility without cumbersome function calls) by using default argument values.

For example, I could make my peak_year_for_name function more flexible by having the function return the most popular n years:

Code:

peak_years_for_name <- function(name_of_interest, n_years) 
{
  total_births_by_year <- babynames %>%
    group_by(year) %>%
    summarize(total_births = sum(n))
  babynames %>%
    filter(name == name_of_interest) %>%
    group_by(year) %>%
    summarize(total_for_name = sum(n)) %>%
    left_join(total_births_by_year) %>%
    mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
    slice_max(order_by = percentage_for_year, n = n_years) %>%
    select(year, percentage_for_year)    
}

As written, this function now requires the user to specify a number of years. The following will produce an error, since I haven’t supplied the n_years argument.

Code:

peak_years_for_name("Colin")

If we think that most often the user will just want to see the single most popular year, I can give that second argument a default value that makes the above work as before.

Function (Re-)definition:

peak_years_for_name <- function(name_of_interest, n_years = 1) 
{
  total_births_by_year <- babynames %>%
    group_by(year) %>%
    summarize(total_births = sum(n))
  babynames %>%
    filter(name == name_of_interest) %>%
    group_by(year) %>%
    summarize(total_for_name = sum(n)) %>%
    left_join(total_births_by_year) %>%
    mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
    slice_max(order_by = percentage_for_year, n = n_years) %>%
    select(year, percentage_for_year)  
}

Some function calls:

peak_years_for_name("Colin")                # Use the default

## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2004               0.135

peak_years_for_name("Colin", 5)             # Override the default

## # A tibble: 5 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2004              0.135 
## 2  2003              0.129 
## 3  2005              0.118 
## 4  2006              0.0977
## 5  2009              0.0958

peak_years_for_name("Colin", n_years = 5)   # Override the default, using the arg name

## # A tibble: 5 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2004              0.135 
## 2  2003              0.129 
## 3  2005              0.118 
## 4  2006              0.0977
## 5  2009              0.0958

Function scope

You might have noticed that in our function we hardcoded the dataset to be babynames. If we had tried to call this function without having run library(babynames) above, we’d get an error, since babynames would not then be defined. If you “unload” the babynames package (undoing the effect of library()) and then try to call the function, R will complain that the babynames data doesn’t exist.

Code:

rm(babynames) # remove the dataset from the environment
detach("package:babynames", unload = TRUE) # unload the package
peak_years_for_name("Colin")  ## ERRRORRRR

Someone calling this function would have no easy way of knowing why this happened, since their function call didn’t refer to that dataset; for that reason among others, it’s not great coding practice to hardcode things inside a function like that.

(Let’s make sure to bring back the babynames library for later)

Code:

library(babynames)
data(babynames)

How does R know where to look for definitions of things that are referenced in a function? A complete answer would involve a lot of caveats, but for the most part, R will first look inside the function for a definition (at its arguments, and at anything that is created within the function itself), and if it doesn’t find anything, it will look in the “global” environment (that is, at stuff that was defined or loaded into the environment by previous assignments or calls).

In theory we could have made a dataset argument to our peak_years_for_name() function so that it didn’t depend on something defined in the global environment:

Code:

peak_year_for_name <- function(dataset, name_of_interest, n_years = 1) 
{
  total_births_by_year <- dataset %>%
    group_by(year) %>%
    summarize(total_births = sum(n))
  dataset %>%
    filter(name == name_of_interest) %>%
    group_by(year) %>%
    summarize(total_for_name = sum(n)) %>%
    left_join(total_births_by_year) %>%
    mutate(percentage_for_year = 100 * total_for_name / total_births) %>%
    slice_max(order_by = percentage_for_year, n = n_years) %>%
    select(year, percentage_for_year)    
}

Notice that we still have hardcoded variable names here, so this function will only work if the dataset we provide has the right columns, but this can be useful if we are going to work with (say) different subsets of a dataset that we obtain by filter()ing:

babynames %>%
  filter(year > 1920 & year < 1999) %>%
  peak_year_for_name(name_of_interest = "Mary")

## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  1921                3.18

Exercises on Functions

The following exercises involve writing functions designed to tell us things about flights, using the nycflights13 package. Load it first:

library(nycflights13)

Recall that this package provides the dataset flights about individual flights, and the datasets airports and planes about… those things (as well as a couple others).

Write a function called top_n_destinations that takes a dataset argument, a carrierID argument, an origin_airport argument, and an n_destinations argument, and retrieves the n_destinations most common airport destinations (dests) of flights taking off from the airport whose code is provided in the origin_airport argument, and how often the carrier flew there.

SOLUTION

top_n_destinations <- function(dataset, carrierID, origin_airport, n_destinations)
{
  dataset %>% 
    filter(
      carrier == carrierID,
      origin  == origin_airport
    ) %>%
    group_by(dest) %>%
    summarize(n_flights = n()) %>%
    slice_max(order_by = n_flights, n = n_destinations)
}

Use your function to find the top five destinations for Delta Airlines (DL) flights from JFK (one of the three airports in New York City) using the flights dataset.

SOLUTION

Use your function to find the top five destinations for American Airlines (AA) flights from JFK.

SOLUTION

Write and test function that, given a dataset, an origin_code and a destination_code (e.g. JFK to LAX), will retrieve the n_carriers carriers with the most flights from the origin to the destination, along with the average arrival delay time for those flights.

SOLUTION

top_carriers_with_delays <- function(
  dataset, 
  origin_code, 
  destination_code,
  n_carriers = 1)
{
  dataset %>%
    filter(
      origin == origin_code,
      dest   == destination_code) %>%
    group_by(carrier) %>%
    summarize(
      n_flights = n(),
      avg_arrival_delay = mean(arr_delay, na.rm = TRUE)) %>%
    slice_max(
      order_by = n_flights,
      n        = n_carriers
    )
}

flights %>%
  top_carriers_with_delays(
    origin_code      = "JFK", 
    destination_code = "LAX",
    n_carriers       = 5)

## # A tibble: 5 x 3
##   carrier n_flights avg_arrival_delay
##   <chr>       <int>             <dbl>
## 1 AA           3217             -1.93
## 2 DL           2501             -3.85
## 3 UA           2059              1.59
## 4 VX           1797              2.10
## 5 B6           1688              2.01

Iteration

Computers are excellent at repetition, as long as you tell them precisely what to repeat.

Remember the example above where I called my function on a bunch of names of people in my family? I can make that even more efficient by creating the list of names I’m interested in up front, and then telling the computer “Call this function on each one of these names, and return the results”.

In R, the lapply() (short for “list apply”) is useful for this sort of thing, provided the list of argument values goes with the first argument of my function.

Code:

my_name_list <- c("Colin", "Megan", "Bruce", "Mary", "Arlo", "Esai")
lapply(my_name_list, FUN = peak_year_for_name, dataset = babynames)

## [[1]]
## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2004               0.135
## 
## [[2]]
## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  1985               0.563
## 
## [[3]]
## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  1951               0.383
## 
## [[4]]
## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  1891                4.10
## 
## [[5]]
## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2017              0.0317
## 
## [[6]]
## # A tibble: 1 x 2
##    year percentage_for_year
##   <dbl>               <dbl>
## 1  2002             0.00289

Notice that the argument I wanted to vary from one call to the next went in the X position for lapply(), whereas the argument(s) that stayed constant were provided to lapply() using their names. Now, this result is a bit inelegant; the function always returns a data frame with a single entry. Wouldn’t it be nice if we could “stack” these into a single data frame?

We can! The bind_rows() function will do this for us. Examine the results of the following code after each step, to make sure you understand what’s happening.

my_name_list %>% 
  lapply(FUN = peak_year_for_name, dataset = babynames) %>%
  bind_rows() %>%
  mutate(name = my_name_list) %>%
  select(name, year, percentage_for_year)

## # A tibble: 6 x 3
##   name   year percentage_for_year
##   <chr> <dbl>               <dbl>
## 1 Colin  2004             0.135  
## 2 Megan  1985             0.563  
## 3 Bruce  1951             0.383  
## 4 Mary   1891             4.10   
## 5 Arlo   2017             0.0317 
## 6 Esai   2002             0.00289

(this time I passed lapply its first (X) argument via a pipe, but it would have been equivalent to put my_name_list first inside the parens instead)

Loops, and Alternatives to Loops

If you have programmed in another language before, you likely would have handled something like this using a “loop” such as a for loop. You can write for loops in R, but it is more “idiomatic” to use the above sort of “apply” construct; and in certain cases it’s more efficient too (which is important when there are a lot of iterations involved).

If you find yourself wanting a for loop, ask yourself whether you could handle what you wanted to do with a function whose first argument is the thing you want to iterate over.

Use lapply() and the function that you wrote in Exercise 1 to find the top airport destination for Delta (DL), American (AA), and United (UA).

SOLUTION

airlines <- c(Delta = "DL", American = "AA", United = "UA")
airlines %>% lapply(
  FUN            = top_n_destinations, 
  dataset        = flights, 
  origin_airport = "JFK",
  n_destinations = 1) %>%
  bind_rows(.id = "airline") %>%
  select(airline, dest, n_flights)

## # A tibble: 3 x 3
##   airline  dest  n_flights
##   <chr>    <chr>     <int>
## 1 Delta    LAX        2501
## 2 American LAX        3217
## 3 United   SFO        2475

Use lapply() and the function that you wrote in Exercise 4 to find the carriers with the most flights from JFK to Chicago O’Hare (ORD), Los Angeles International (LAX), and San Francisco International (SFO) airports, respectively. In order to have the destination shown in the output, you will need to do a couple of things: First, the vector of destinations that you pass to lapply() should have named entries (it should be in the form c(name1 = value1, name2 = value2, ...), where name1, name2, etc are the labels you want displayed in the output, and value1, value2, etc are the actual argument values you’re passing to your function). Second, set .id = "destination" as an argument to bind_rows() so that the output includes a column called destination that contains name1, name2, etc.

SOLUTION

Applying a function to a grouped data frame

The following function computes the top 10 most popular names in the dataset passed to it via the dataset argument:

Code:

top_n_names <- function(dataset, n_returned) 
{
  overall_total <- dataset %>%
    summarize(total_births = sum(n)) %>%
    pull(total_births)
  dataset %>%
    group_by(name) %>%
    summarize(
      total_for_name      = sum(n),
      percent_for_name = total_for_name / overall_total * 100) %>%
    slice_max(
      order_by = total_for_name, 
      n        = n_returned) %>%
    rownames_to_column(var = "rank")
}

Here we use it to find the top 10 names for babies born in 2000.

babynames %>%
  filter(year == 2000) %>%
  top_n_names(n_returned = 10)

## # A tibble: 10 x 4
##    rank  name        total_for_name percent_for_name
##    <chr> <chr>                <int>            <dbl>
##  1 1     Jacob                34530            0.914
##  2 2     Michael              32149            0.851
##  3 3     Matthew              28616            0.757
##  4 4     Joshua               27592            0.730
##  5 5     Emily                25983            0.688
##  6 6     Christopher          24981            0.661
##  7 7     Nicholas             24691            0.654
##  8 8     Andrew               23684            0.627
##  9 9     Hannah               23105            0.612
## 10 10    Joseph               22849            0.605

If we want to apply this function to find the most popular name in a particular decade, we could simply filter our data to keep only years in the range of interest, and call the function on the filtered data.

But suppose we want to do this for every decade in the 20th century. We could theoretically create 10 datasets, put them in a list, and use lapply on the list of datasets. But it’s simpler to take advantage of the do() function for this. This is seen most easily by example:

Code:

## The floor() function rounds down to the nearest integer
top_by_decade <- babynames %>%
  mutate(decade = 10 * floor(year / 10)) %>%
  group_by(decade) %>%
  do(
    top_n_names(
      dataset    = ., 
      n_returned = 10))
## The period is a placeholder for "each dataset in the list"
top_by_decade

## # A tibble: 140 x 5
## # Groups:   decade [14]
##    decade rank  name    total_for_name percent_for_name
##     <dbl> <chr> <chr>            <int>            <dbl>
##  1   1880 1     Mary             92030             3.82
##  2   1880 2     John             90395             3.75
##  3   1880 3     William          85246             3.54
##  4   1880 4     James            54323             2.26
##  5   1880 5     George           47980             1.99
##  6   1880 6     Charles          46879             1.95
##  7   1880 7     Anna             38320             1.59
##  8   1880 8     Frank            31135             1.29
##  9   1880 9     Joseph           26404             1.10
## 10   1880 10    Emma             25512             1.06
## # … with 130 more rows

Note that since top_n_names() returns a data frame with n_returned rows (for whatever value of n_returned we supply when we call the function), the result of this operation is a big “stacked” data frame with n_returned names per decade.

Note: If you have worked with the mosaic package, you likely used another function called do(). It’s related to the dplyr one, but not identical, so if you are working in an R session with both packages loaded, it’s a good idea to be explicit about which one you want to be using. You can do this by writing either dplyr::do() or mosaic::do().

Use do() with your top_n_destinations function from Exercise 1 to find the top destination for each airline flights from JFK in each month of 2013. Since carrierID is an input to top_n_destinations, you’ll probably want to pull the carrier column from each grouped dataset using pull(., carrier) in your call to top_n_destinations.

STAT 209: Lab 11

Colin Reimer Dawson

July 15, 2021

Making Your Code More Modular

Goal

The Data

A repetitive task

Writing a function

Function signatures

Return values

Default arguments

Function scope

Exercises on Functions

SOLUTION

SOLUTION

SOLUTION

SOLUTION

Iteration

Loops, and Alternatives to Loops

SOLUTION

SOLUTION

Applying a function to a grouped data frame

SOLUTION