dplyr
Now that you (are on your way to) know(ing) two different data wrangling languages (dplyr
and SQL
), it’s worth spending a minute thinking about their relative strengths and weaknesses. Here are a few strengths of each one:
dplyr
tidyverse
SQL
dplyr
is much more popular now than it was even three or four years ago)There are two big reasons we can’t rely exclusively on SQL, however: (1) it doesn’t support plotting, and (2) it doesn’t support statistical modeling. So if we want to do more than show some tables, we need to be able to pass the results of our SQL queries back into R, so we can create graphs and (though we’re not focusing on this in this class) models.
Use SQL together with ggplot
to produce visualizations from large datasets.
In particular, we will try to verify the following claim from the FiveThirtyEight article here:
“In 2014, the 6 million domestic flights the U.S. government tracked required an extra 80 million minutes to reach their destinations. The majority of flights – 54 percent – arrived ahead of schedule in 2014. (The 80 million minutes figure cited earlier is a net number. It consists of about 115 million minutes of delays minus 35 million minutes saved from early arrivals.)”
as well as to reproduce the graphic therein (shown below).
Figure: Reproduced from MDSR p. 291
We need to load packages and set up a connection to the server again.
Code:
library(dbplyr) ## Note the 'b'; this is not dplyr
library(mdsr) ## Package for our book
library(RMySQL) ## Standard R/SQL interface package
db <- dbConnect_scidb("airlines")
For convenience here is the list of basic verbs again:
Image Source: Baumer et al. Modern Data Science with R.
Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.
Here’s the dplyr
to SQL
translation summary again:
Image Source: Baumer et al. Modern Data Science with R
And, it bears repeating, in all caps this time:
IMPORTANT: ALWAYS LIMIT
YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE!
One last reminder: to designate a code chunk “SQL”, use {sql connection=db}
in the chunk options (where db
is whatever you named your connection in a previous R code chunk) in place of the r
that is usually there.
flights
data (use DESCRIBE
to see them)? For columns that represent summary statistics, how can these be computed (write down the computation steps in sentences; not in code yet). Working backwards like this can be a good way to approach a wrangling+visualization problem.Sample solution:
# The quote references several summary statistics, which are
# aggregated over the entire dataset:
# * there were 6 million flights altogether
# * the total net(?) delay was 80 million minutes
# * this net delay involved 115 million minutes from late flights
# * and -35 million minutes from early flights
# * 54% of flights arrived early
#
# Since these statistics describe the dataset as a whole, it seems
# that the relevant summary table will have only one row, with
# the following columns:
# * Number of flights
# * Total number of minutes of delay from late flights
# * Total number of minutes saved from early flights
# * Total number (or %) of flights that were early
#
# Since the cases in `flights` are flights, the total number
# of flights can be obtained by simply summing the number of records.
# We will need to create a boolean (binary) variable that
# encodes for each flight whether arr_delay > 0.
# We will need to compute the proportion of the time that binary
# variable is "true".
# We will need to sum the arr_delay variable for flights where
# arr_delay > 0, and separately for flights where arr_delay < 0,
# as well as summing it overall.
#
# For the graph, we could potentially create either a summarized
# data table in which cases are airlines, or a flight-level
# data table in which the summarizing happens when the bar graph
# is created. Since this is a very large dataset, we will want
# to do the summarizing on the SQL side before we bring data into
# memory to be visualized, so I will assume that form here.
#
# * Each row should represent an airline.
# * We will need a column encoding the airline's name
# * We will need a column encoding the percentage of flights that
# were delayed 15-119 minutes, and a column encoding the percentage
# of flights delayed 120 minutes or more.
# * For ggplot purposes, we probably want to keep all of the percentages
# in a single column, with another column indicating whether the
# percentage corresponds to a "short" or a "long" delay (that is, we
# want to "gather" the data into a "tidy" format
flights
dataset, and write an SQL query to calculate the necessary summary information. A tip: you can use commands of the form if(<condition>, <value if true>, <value if false>)
in SQL to create a variable that has one value if a condition is met and another value if it isn’t (the angle brackets indicate placeholders, and shouldn’t actually be typed). Note: The result should just be one row, but include a LIMIT
clause anyway, just in case the result isn’t what you intend. Note: I found that doing this on all the flights in 2014 takes a really long time, since new variables are being computed for each one. To save time, each member of your group can compute the quantities of interest for three or four days spread throughout 2014 (you might use your birthdays, though if they’re all in the same season this will skew the results), multiplying count variables by 365 to extrapolate to the year. The group can average together their results.Sample solution:
SELECT
365 * sum(1) as num_flights, -- could also do count(*)
sum(if(arr_delay < 0, 1, 0)) / sum(1) AS early_pct,
365 * sum(if(arr_delay > 0, arr_delay, 0)) / 1000000 AS min_late, -- in millions
365 * sum(if(arr_delay < 0, arr_delay, 0)) / 1000000 AS min_early,
365 * sum(arr_delay) / 1000000 AS net_delay
FROM flights AS f
WHERE year = 2014 AND month = 1 AND day = 4
LIMIT 0,6
num_flights | early_pct | min_late | min_early | net_delay |
---|---|---|---|---|
5308560 | 0.2771 | 203.9667 | -15.169 | 188.7977 |
My birthday is in the winter so there are more delays, but if you average, you should notice that the numbers still don’t quite match what’s in the quote, and not just because of the sampling error involved.
The total minutes early come close, but the total minutes late is way under what FiveThirtyEight reports. It turns out, when you read FiveThirtyEight’s methodology, that cancelled flights have arr_delay = 0
in the data, and so these aren’t contributing to the statistics we’ve computed; but these flights obviously hold travelers up.
FiveThirtyEight did some modeling to estimate an arr_delay
number for cancelled flights; hence the discrepancy. We won’t try to reproduce what they did; instead as an approximation, we will consider cancelled flights to be delayed by 4.5 hours (following another quote in the article suggesting a “quick and dirty” estimate of 4-5 hours for each cancelled flight).
Sample Solution:
SELECT
365 * count(*) as num_flights, -- could also do sum(1)
sum(if(arr_delay < 0, 1, 0)) / count(*) AS early_pct,
365 * sum(if(arr_delay > 0, arr_delay, if(cancelled = 1, 270, 0))) / 1000000 AS min_late, -- in millions
365 * sum(if(arr_delay < 0, arr_delay, 0)) / 1000000 AS min_early,
365 * sum(arr_delay) / 1000000 AS net_delay
FROM flights AS f
WHERE year = 2014 AND month = 1 AND day = 4
LIMIT 0,6
num_flights | early_pct | min_late | min_early | net_delay |
---|---|---|---|---|
5308560 | 0.2771 | 324.8876 | -15.169 | 188.7977 |
Now let’s create the dataset for the graph. We’re going to need to pull the results into R, but let’s first write the query in SQL to confirm that we get what we want.
name
field from the carriers
data table. Note that the graph is sorted in descending order by percentage of short delays.Sample Solution:
SELECT
f.carrier,
c.name,
sum(if(arr_delay BETWEEN 15 AND 119, 1, 0)) / count(*) as short_delay_pct,
sum(if(arr_delay >= 120 OR cancelled = 1 OR diverted = 1, 1, 0)) / count(*) as long_delay_pct
FROM flights as f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE year = 2014 AND month = 1 AND day = 4
GROUP BY f.carrier
ORDER BY short_delay_pct DESC
LIMIT 0,50
carrier | name | short_delay_pct | long_delay_pct |
---|---|---|---|
FL | AirTran Airways Corporation | 0.5418 | 0.0605 |
WN | Southwest Airlines Co. | 0.5392 | 0.2330 |
F9 | Frontier Airlines Inc. | 0.4128 | 0.5174 |
UA | United Air Lines Inc. | 0.4035 | 0.2326 |
OO | SkyWest Airlines Inc. | 0.3306 | 0.2141 |
US | US Airways Inc. | 0.3292 | 0.0396 |
MQ | Envoy Air | 0.3208 | 0.1615 |
VX | Virgin America | 0.3195 | 0.0888 |
DL | Delta Air Lines Inc. | 0.2931 | 0.0648 |
EV | ExpressJet Airlines Inc. | 0.2808 | 0.2601 |
We’re now done with the SQL part of the process!
Now that we have a small dataset, we can turn it into an R data frame and do our finishing wrangling touches in dplyr
and our visualization in ggplot2
.
query
.Sample Solution:
query <-
"SELECT
f.carrier,
c.name,
sum(if(arr_delay BETWEEN 15 AND 119, 1, 0)) / sum(1) as short_delay_pct,
sum(if(arr_delay >= 120 OR cancelled = 1 OR diverted = 1, 1, 0)) / sum(1) as long_delay_pct
FROM flights as f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE year = 2014 AND month = 1 AND day = 4
GROUP BY f.carrier
ORDER BY short_delay_pct DESC
LIMIT 0,50"
query
string, create an R data frame that contains the relevant information with db %>% dbGetQuery(query) %>% collect()
. The use of collect()
here brings the actual data from the table into memory; not just a pointer to a “view” of the data on the remote server, so don’t do this until you know that your query produces the small result set you want.Sample Solution:
## carrier name short_delay_pct long_delay_pct
## 1 FL AirTran Airways Corporation 0.5418 0.0605
## 2 WN Southwest Airlines Co. 0.5392 0.2330
## 3 F9 Frontier Airlines Inc. 0.4128 0.5174
## 4 UA United Air Lines Inc. 0.4035 0.2326
## 5 OO SkyWest Airlines Inc. 0.3306 0.2141
## 6 US US Airways Inc. 0.3292 0.0396
## 7 MQ Envoy Air 0.3208 0.1615
## 8 VX Virgin America 0.3195 0.0888
## 9 DL Delta Air Lines Inc. 0.2931 0.0648
## 10 EV ExpressJet Airlines Inc. 0.2808 0.2601
## 11 AA American Airlines Inc. 0.2647 0.0487
## 12 B6 JetBlue Airways 0.2535 0.6416
## 13 AS Alaska Airlines Inc. 0.1480 0.0072
## 14 HA Hawaiian Airlines Inc. 0.0860 0.0054
Getting the airline names to display as they are in the graph will require some string manipulation. For one thing, we want to strip away the formal company identifiers like Co.
and Inc.
. Moreover, we don’t really need the Airlines
and Airways
bits.
We can use the gsub()
function to find these substrings and replace them. The syntax for this uses what’s called “regular expressions”; essentially a pattern matching language that lets us define a variety of more or less general “wildcards”. Here are the gsub()
commands we want.
delays <- delays %>%
## Replaces Airlines, Airways or Air Lines with nothing
mutate(name = gsub("Air(lines|ways| Lines)", "", name)) %>%
## Replaces Inc., Co., or Corporation with nothing
## Note the need for \\ before a period; the period
## has a special meaning in regular expressions
mutate(
name = gsub("(Inc\\.|Co\\.|Corporation)", "", name)) %>%
## Removes any parentheticals. Here we see the "special"
## meaning of the period: it stands for "match any single
## character" (except a line-break).
## It is followed by a *, which acts as a
## 'modifier' on the previous string element; in this case
## the period: the modifier says "match zero or more
## instances of the preceding thing". So .* will match
## basically any string that doesn't include a line
## break.
mutate(name = gsub("\\(.*\\)", "", name)) %>%
## The $ is a special character that represents
## the end of a line; and the + is like the *,
## except it will only match if there is at least
## one instance of the preceding character.
## So the pattern below will match one or more
## spaces that occur at the end of a string.
## We replace that whitespace with nothing.
mutate(name = gsub(" *$", "", name))
delays
## carrier name short_delay_pct long_delay_pct
## 1 FL AirTran 0.5418 0.0605
## 2 WN Southwest 0.5392 0.2330
## 3 F9 Frontier 0.4128 0.5174
## 4 UA United 0.4035 0.2326
## 5 OO SkyWest 0.3306 0.2141
## 6 US US 0.3292 0.0396
## 7 MQ Envoy Air 0.3208 0.1615
## 8 VX Virgin America 0.3195 0.0888
## 9 DL Delta 0.2931 0.0648
## 10 EV ExpressJet 0.2808 0.2601
## 11 AA American 0.2647 0.0487
## 12 B6 JetBlue 0.2535 0.6416
## 13 AS Alaska 0.1480 0.0072
## 14 HA Hawaiian 0.0860 0.0054
You’ll notice that you get more airlines than appear in the graph. We could do some consolidation, but we’ll set this aside for now.
delays %>%
ggplot(aes(x = name)) +
geom_bar(
aes(y = short_delay_pct + long_delay_pct),
fill = "gold",
stat = "identity") +
geom_bar(
aes(y = long_delay_pct),
fill = "red",
stat = "identity")
gather()ing
the percentagesFor visualizing this data, we’ll want to create a single pct
column representing percentages of both delay categories, and a delay_type
column indicating whether the percentage refers to short or long delays. This is a job for gather()
.
Remember that gather()
has a key=
argument that gives the name of the new grouping variable; a value=
column that gives the name of the new merged quantitative variable, and unnamed arguments listing the column headings in the “wide” data that we are merging.
gather()
Sample Solution:
library(tidyr)
delay_long_format <- delays %>%
rename(short = short_delay_pct, long = long_delay_pct) %>%
mutate(name = reorder(name, short, max)) %>%
gather(key = length, value = delay_pct, short, long) %>%
arrange(name)
delay_long_format
## carrier name length delay_pct
## 1 HA Hawaiian short 0.0860
## 2 HA Hawaiian long 0.0054
## 3 AS Alaska short 0.1480
## 4 AS Alaska long 0.0072
## 5 B6 JetBlue short 0.2535
## 6 B6 JetBlue long 0.6416
## 7 AA American short 0.2647
## 8 AA American long 0.0487
## 9 EV ExpressJet short 0.2808
## 10 EV ExpressJet long 0.2601
## 11 DL Delta short 0.2931
## 12 DL Delta long 0.0648
## 13 VX Virgin America short 0.3195
## 14 VX Virgin America long 0.0888
## 15 MQ Envoy Air short 0.3208
## 16 MQ Envoy Air long 0.1615
## 17 US US short 0.3292
## 18 US US long 0.0396
## 19 OO SkyWest short 0.3306
## 20 OO SkyWest long 0.2141
## 21 UA United short 0.4035
## 22 UA United long 0.2326
## 23 F9 Frontier short 0.4128
## 24 F9 Frontier long 0.5174
## 25 WN Southwest short 0.5392
## 26 WN Southwest long 0.2330
## 27 FL AirTran short 0.5418
## 28 FL AirTran long 0.0605
Most of the work is done now; all that remains is to actually produce the bar graph. Some tips here: don’t forget to use stat="identity"
with geom_bar()
, since we have precomputed the percentages.
You may need to reorder the label variable so the airlines appear in descending order of short delays. You can do this with reorder(<variable to sort>, <variable to sort on>, max)
. This iteratively applies the max
function to the values of the <variable to sort on>
, and then finds the matching value of the <variable to sort>
, and puts it first in the order. Then it does it again with the values that remain. Etc.
Construct the barplot with the bars oriented vertically first, then use + coord_flip()
in your ggplot
chain to turn it sideways.
Use scale_y_continuous()
to set the limits on the “y” axis. Load ggthemes
and use theme_fivethirtyeight()
to get closer to the aesthetic they use.
Post your graph to #lab13
, along with the honor pledge, in Slack by Tuesday.