Integrating SQL with R and ggplot2

SQL vs dplyr

Now that you (are on your way to) know(ing) two different data wrangling languages (dplyr and SQL), it’s worth spending a minute thinking about their relative strengths and weaknesses. Here are a few strengths of each one:

dplyr

  • Runs on the same computer you’re working on
  • Has a “functional programming” style: Each wrangling verb has a clear input and output (whether this is a strength depends on your attitude toward functional programming, I suppose)
  • Is integrated into the larger tidyverse
  • Is more flexible: can combine standard verbs with arbitrary R code, incorporate code in custom functions, etc

SQL

  • Is more efficient for large datasets
  • Code reads (sort of) like English sentences (this is also a matter of taste; the code is probably ‘easier on the eyes’ than R code, but I find it harder to pull apart than the functional style of R; but… your mileage may vary)
  • Is more widely used in industry outside data science specifically (though R and the tidyverse have gotten far more widely used in the last few years)

There are two big reasons we can’t rely exclusively on SQL, however:

  • it doesn’t support plotting
  • it doesn’t support statistical modeling

So if we want to do more than show some tables, we need to be able to pass the results of our SQL queries back into R, so we can create graphs and (though we’re not focusing on this in this class) models.

Goal of this lab

Use SQL together with ggplot to produce visualizations from large datasets.

In particular, we will try to verify the following claim from the FiveThirtyEight article here:

“In 2014, the 6 million domestic flights the U.S. government tracked required an extra 80 million minutes to reach their destinations. The majority of flights – 54 percent – arrived ahead of schedule in 2014. (The 80 million minutes figure cited earlier is a net number. It consists of about 115 million minutes of delays minus 35 million minutes saved from early arrivals.)”

as well as to reproduce the graphic therein (shown below).

Figure: Reproduced from MDSR p. 291

Figure: Reproduced from MDSR p. 291

Setup Boilerplate and Reference (Repeated from Labs 13-14)

We need to load packages and set up a connection to the server again.

Code:

For convenience here is the list of basic verbs again:

Image Source: Baumer et al. Modern Data Science with R.

Image Source: Baumer et al. Modern Data Science with R.

Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.

Here’s the dplyr to SQL translation summary again:

Image Source: Baumer et al. Modern Data Science with R

Image Source: Baumer et al. Modern Data Science with R

And, it bears repeating, in all caps:

IMPORTANT: ALWAYS LIMIT YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE!

One last reminder: to designate a code chunk “SQL”, use {sql connection=db} in the chunk options (where db is whatever you named your connection in a previous R code chunk) in place of the r that is usually there.

Extracting the necessary summary statistics

  1. Before diving into writing any queries, examine the quote and the graph together with your group and work together to sketch out what the summarized data table(s) should look like to verify the quote and reproduce the graph. Note that flights are only classified as a “late arrival” if the delay is 15 minutes or more. What are the rows, what columns are needed, which of these correspond to columns in the flights data, etc.? Remember you can use use DESCRIBE in an SQL chunk to see an overview of the columns. For columns that represent summary statistics, how can these be computed? Write down the computation steps in sentences; not in code yet. Working backwards like this can be a good way to approach a wrangling+visualization problem.

OUTLINE OF APPROACH

The quote references several summary statistics, which are aggregated over the entire dataset:

  • there were 6 million flights altogether
  • the total net(?) delay was 80 million minutes
  • this net delay involved 115 million minutes from late flights
  • and -35 million minutes from early flights
  • 54% of flights arrived early

Since these statistics describe the dataset as a whole, it seems that the relevant summary table will have only one row, with the following columns:

  • Number of flights
  • Total number of minutes of delay from late flights
  • Total number of minutes saved from early flights
  • Total number (or %) of flights that were early

Since the cases in flights are flights, the total number of flights can be obtained by counting records. We will need to

  • create a binary variable that encodes for each flight whether arr_delay > 0
  • compute the proportion of the time that binary variable is “true”.
  • sum the arr_delay variable for flights where arr_delay > 0, and separately for flights where arr_delay < 0, as well as summing it overall.

For the graph, we could potentially create either a summarized data table in which cases are airlines, or a flight-level data table in which the summarizing happens when the bar graph is created. Since this is a very large dataset, we will want to do as much summarizing as we can on the SQL side before we bring data into memory as an R object to be visualized, so the first representation is likely preferable. The summary table should be structured as follows:

  • Each row should represent an airline.
  • We will need a column encoding the airline’s name
  • We will need a column encoding the percentage of flights that were delayed 15-119 minutes, and a column encoding the percentage of flights delayed 120 minutes or more.
  • For ggplot purposes, we probably want to keep all of the percentages in a single column, with another column indicating whether the percentage corresponds to a “short” or a “long” delay (that is, we want to “pivot” the data into a “long” format

  1. Take the flights dataset, and write an SQL query to calculate the necessary summary information. To save computing strain, just use the flights that took place on your birthday in 2014, rather than all of 2014, and multiply each variable (that isn’t a percentage) by 365 to extrapolate to the whole year (there are of course seasonal differences in flight delays, so this is not a representative sample, so we shouldn’t expect extremely similar results to the FiveThirtyEight article). A tip: you can use commands of the form if(<condition>, <value if true>, <value if false>) in SQL to create a variable that has one value if a condition is met and another value if it isn’t (the angle brackets indicate placeholders, and shouldn’t actually be typed). Note: The result should just be one row, but include a LIMIT clause anyway, just in case the result isn’t what you intend.

SOLUTION
1 records
num_flights early_pct min_late min_early net_delay
5308560 0.2771 203.9667 -15.169 188.7977

My birthday is in the winter so there are more delays, but even for less travel-challenged times of year, you should notice that the numbers still don’t quite match what’s in the quote.

The total minutes early come close, but the total minutes late is way under what FiveThirtyEight reports. It turns out, when you read FiveThirtyEight’s methodology, that cancelled flights have arr_delay = 0 in the data, and so these aren’t contributing to the statistics we’ve computed; but these flights obviously hold travelers up.

FiveThirtyEight did some modeling to estimate an arr_delay number for cancelled flights; hence the discrepancy. We won’t try to reproduce what they did; instead as an approximation, we will consider cancelled flights to be delayed by 4.5 hours (following another quote in the article suggesting a “quick and dirty” estimate of 4-5 hours for each cancelled flight).

  1. Revise your query to add 270 minutes to each canceled flight. Hint: Cancelled flights show an arrival delay of 0 as-is, so you can modify your if() statement that adds up the minutes late, tallying 270 for a cancelled flight instead of zero.

SOLUTION
1 records
num_flights early_pct min_late min_early net_delay
5308560 0.2771 324.8876 -15.169 188.7977

Now let’s create the dataset for the graph. We’re going to need to pull the results into R, but let’s first write the query in SQL to confirm that we get what we want.

  1. Write an SQL query to produce a data table with the summary statistics needed for the graph. Don’t worry about the pivot_longer() step that we’ll need to get the data into the “tidy” format; we’ll do that in R later. Hint: for the labels, you’ll want the name field from the carriers data table. Note: that the graph is sorted in descending order by percentage of short delays.

SOLUTION
Displaying records 1 - 10
carrier name short_delay_pct long_delay_pct
FL AirTran Airways Corporation 0.5418 0.0605
WN Southwest Airlines Co. 0.5392 0.2330
F9 Frontier Airlines Inc. 0.4128 0.5174
UA United Air Lines Inc. 0.4035 0.2326
OO SkyWest Airlines Inc. 0.3306 0.2141
US US Airways Inc. 0.3292 0.0396
MQ Envoy Air 0.3208 0.1615
VX Virgin America 0.3195 0.0888
DL Delta Air Lines Inc. 0.2931 0.0648
EV ExpressJet Airlines Inc. 0.2808 0.2601

Bringing the result set into R

We’re now done with the SQL part of the process!

Now that we have a small dataset, we can turn it into an R data frame and do our finishing wrangling touches in dplyr and our visualization in ggplot2.

  1. Once you have your SQL query working, create an R object that contains the text of the query as one big quoted string (you can include the line breaks). You can call it whatever you want; I will refer to this variable below as query.

SOLUTION

  1. Having defined the query string, create an R data frame that contains the relevant information with db %>% dbGetQuery(query) %>% collect(). The use of collect() here brings the actual data from the table into memory; not just a pointer to a “view” of the data on the remote server, so don’t do this until you know that your query produces the small result set you want.

SOLUTION
##    carrier                        name short_delay_pct long_delay_pct
## 1       FL AirTran Airways Corporation          0.5418         0.0605
## 2       WN      Southwest Airlines Co.          0.5392         0.2330
## 3       F9      Frontier Airlines Inc.          0.4128         0.5174
## 4       UA       United Air Lines Inc.          0.4035         0.2326
## 5       OO       SkyWest Airlines Inc.          0.3306         0.2141
## 6       US             US Airways Inc.          0.3292         0.0396
## 7       MQ                   Envoy Air          0.3208         0.1615
## 8       VX              Virgin America          0.3195         0.0888
## 9       DL        Delta Air Lines Inc.          0.2931         0.0648
## 10      EV    ExpressJet Airlines Inc.          0.2808         0.2601
## 11      AA      American Airlines Inc.          0.2647         0.0487
## 12      B6             JetBlue Airways          0.2535         0.6416
## 13      AS        Alaska Airlines Inc.          0.1480         0.0072
## 14      HA      Hawaiian Airlines Inc.          0.0860         0.0054

Cleaning up the labels

Getting the airline names to display as they are in the graph will require some text manipulation. For one thing, we want to strip away the formal company identifiers like Co. and Inc.. Moreover, we don’t really need the Airlines and Airways bits.

We can use the gsub() function to find these substrings and replace them. The syntax for this uses what’s called “regular expressions”; essentially a pattern matching language that lets us define a variety of more or less general “wildcards”. Here are the gsub() commands we want.

##    carrier           name short_delay_pct long_delay_pct
## 1       FL        AirTran          0.5418         0.0605
## 2       WN      Southwest          0.5392         0.2330
## 3       F9       Frontier          0.4128         0.5174
## 4       UA         United          0.4035         0.2326
## 5       OO        SkyWest          0.3306         0.2141
## 6       US             US          0.3292         0.0396
## 7       MQ      Envoy Air          0.3208         0.1615
## 8       VX Virgin America          0.3195         0.0888
## 9       DL          Delta          0.2931         0.0648
## 10      EV     ExpressJet          0.2808         0.2601
## 11      AA       American          0.2647         0.0487
## 12      B6        JetBlue          0.2535         0.6416
## 13      AS         Alaska          0.1480         0.0072
## 14      HA       Hawaiian          0.0860         0.0054

You’ll notice that you get more airlines than appear in the graph. We could do some consolidation, but we’ll set this aside for now.

Pivoting the data to “long” form

For visualizing this data, we’ll want to create a single pct column representing percentages of both delay categories, and a delay_type column indicating whether the percentage refers to short or long delays. This is a job for pivot_longer().

Remember that pivot_longer() has a cols= argument where we specify the names of columns we’re “stacking”, a names_to argument where we specify a name for the new grouping variable, and a values_to= argument where we specify a name for the new merged quantitative variable.

  1. Execute an appropriate pivot_longer() to obtain a dataset that has carrier, name, delay_type (short or long), pct_of_type.

SOLUTION
## # A tibble: 28 x 4
##    carrier name       delay_type pct_of_type
##    <chr>   <fct>      <chr>            <dbl>
##  1 HA      Hawaiian   short           0.086 
##  2 HA      Hawaiian   long            0.0054
##  3 AS      Alaska     short           0.148 
##  4 AS      Alaska     long            0.0072
##  5 B6      JetBlue    short           0.254 
##  6 B6      JetBlue    long            0.642 
##  7 AA      American   short           0.265 
##  8 AA      American   long            0.0487
##  9 EV      ExpressJet short           0.281 
## 10 EV      ExpressJet long            0.260 
## # … with 18 more rows

Creating the graph

Most of the work is done now; all that remains is to actually produce the bar graph. Some tips here: don’t forget to use stat="identity" with geom_bar(), since we have precomputed the percentages.

You may need to reorder the label variable so the airlines appear in descending order of short delays. You can do this with reorder(<variable to be reordered>, <variable to define the ordering>, max) (don’t type the angle brackets). This iteratively applies the max function to the values of the <variable to define the ordering>, and then finds the matching value of the <variable to be reordered>, and puts it first in the order. Then it does it again with the values that remain. Etc.

  1. Construct the barplot with the bars oriented vertically first, then use + coord_flip() in your ggplot chain to turn it sideways. (Note: To get the percentages to display on the bars the way they are in the FiveThirtyEight graph, we can use geom_text(). However, it’s a bit awkward to specify the “y” coordinate (which will become the “x” coordinate after flipping the coordinates) for the labels from the long data, since the labels for the longer delays are positioned based on the percentage of short delays. So as a hack you may want to join the long and wide forms of the data so you have the percent of short delays in every row.) Use scale_y_continuous() to set the limits on the “y” axis. Load the ggthemes library and use theme_fivethirtyeight() to get closer to the aesthetic they use.

SOLUTION


  1. Post your graph (not the code) to #lab15 in Slack