ggplot2
dplyr
Now that you (are on your way to) know(ing) two different data wrangling languages (dplyr
and SQL
), it’s worth spending a minute thinking about their relative strengths and weaknesses. Here are a few strengths of each one:
dplyr
tidyverse
SQL
There are two big reasons we can’t rely exclusively on SQL, however:
So if we want to do more than show some tables, we need to be able to pass the results of our SQL queries back into R, so we can create graphs and (though we’re not focusing on this in this class) models.
Use SQL together with ggplot
to produce visualizations from large datasets.
In particular, we will try to verify the following claim from the FiveThirtyEight article here:
“In 2014, the 6 million domestic flights the U.S. government tracked required an extra 80 million minutes to reach their destinations. The majority of flights – 54 percent – arrived ahead of schedule in 2014. (The 80 million minutes figure cited earlier is a net number. It consists of about 115 million minutes of delays minus 35 million minutes saved from early arrivals.)”
as well as to reproduce the graphic therein (shown below).
We need to load packages and set up a connection to the server again.
Code:
library(dbplyr) ## Note the 'b'; this is not dplyr
library(mdsr) ## Package for our book
library(RMySQL) ## Standard R/SQL interface package
db <- dbConnect_scidb("airlines")
For convenience here is the list of basic verbs again:
Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.
Here’s the dplyr
to SQL
translation summary again:
And, it bears repeating, in all caps:
IMPORTANT: ALWAYS LIMIT
YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE!
One last reminder: to designate a code chunk “SQL”, use {sql connection=db}
in the chunk options (where db
is whatever you named your connection in a previous R code chunk) in place of the r
that is usually there.
flights
data, etc.? Remember you can use use DESCRIBE
in an SQL chunk to see an overview of the columns. For columns that represent summary statistics, how can these be computed? Write down the computation steps in sentences; not in code yet. Working backwards like this can be a good way to approach a wrangling+visualization problem.The quote references several summary statistics, which are aggregated over the entire dataset:
Since these statistics describe the dataset as a whole, it seems that the relevant summary table will have only one row, with the following columns:
Since the cases in flights
are flights, the total number of flights can be obtained by counting records. We will need to
arr_delay > 0
arr_delay > 0
, and separately for flights where arr_delay < 0
, as well as summing it overall.For the graph, we could potentially create either a summarized data table in which cases are airlines, or a flight-level data table in which the summarizing happens when the bar graph is created. Since this is a very large dataset, we will want to do as much summarizing as we can on the SQL side before we bring data into memory as an R object to be visualized, so the first representation is likely preferable. The summary table should be structured as follows:
flights
dataset, and write an SQL query to calculate the necessary summary information. To save computing strain, just use the flights that took place on your birthday in 2014, rather than all of 2014, and multiply each variable (that isn’t a percentage) by 365 to extrapolate to the whole year (there are of course seasonal differences in flight delays, so this is not a representative sample, so we shouldn’t expect extremely similar results to the FiveThirtyEight article). A tip: you can use commands of the form if(<condition>, <value if true>, <value if false>)
in SQL to create a variable that has one value if a condition is met and another value if it isn’t (the angle brackets indicate placeholders, and shouldn’t actually be typed). Note: The result should just be one row, but include a LIMIT
clause anyway, just in case the result isn’t what you intend.SELECT
365 * sum(1) as num_flights, -- could also do count(*)
sum(if(arr_delay < 0, 1, 0)) / sum(1) AS early_pct,
365 * sum(if(arr_delay > 0, arr_delay, 0)) / 1000000 AS min_late, -- in millions
365 * sum(if(arr_delay < 0, arr_delay, 0)) / 1000000 AS min_early,
365 * sum(arr_delay) / 1000000 AS net_delay
FROM flights AS f
WHERE year = 2014 AND month = 1 AND day = 4
LIMIT 0,6
num_flights | early_pct | min_late | min_early | net_delay |
---|---|---|---|---|
5308560 | 0.2771 | 203.9667 | -15.169 | 188.7977 |
My birthday is in the winter so there are more delays, but even for less travel-challenged times of year, you should notice that the numbers still don’t quite match what’s in the quote.
The total minutes early come close, but the total minutes late is way under what FiveThirtyEight reports. It turns out, when you read FiveThirtyEight’s methodology, that cancelled flights have arr_delay = 0
in the data, and so these aren’t contributing to the statistics we’ve computed; but these flights obviously hold travelers up.
FiveThirtyEight did some modeling to estimate an arr_delay
number for cancelled flights; hence the discrepancy. We won’t try to reproduce what they did; instead as an approximation, we will consider cancelled flights to be delayed by 4.5 hours (following another quote in the article suggesting a “quick and dirty” estimate of 4-5 hours for each cancelled flight).
if()
statement that adds up the minutes late, tallying 270 for a cancelled flight instead of zero.SELECT
365 * count(*) as num_flights, -- could also do sum(1)
sum(if(arr_delay < 0, 1, 0)) / count(*) AS early_pct,
365 * sum(if(arr_delay > 0, arr_delay, if(cancelled = 1, 270, 0))) / 1000000 AS min_late, -- in millions
365 * sum(if(arr_delay < 0, arr_delay, 0)) / 1000000 AS min_early,
365 * sum(arr_delay) / 1000000 AS net_delay
FROM flights AS f
WHERE year = 2014 AND month = 1 AND day = 4
LIMIT 0,6
num_flights | early_pct | min_late | min_early | net_delay |
---|---|---|---|---|
5308560 | 0.2771 | 324.8876 | -15.169 | 188.7977 |
Now let’s create the dataset for the graph. We’re going to need to pull the results into R, but let’s first write the query in SQL to confirm that we get what we want.
pivot_longer()
step that we’ll need to get the data into the “tidy” format; we’ll do that in R later. Hint: for the labels, you’ll want the name
field from the carriers
data table. Note: that the graph is sorted in descending order by percentage of short delays.carrier | name | short_delay_pct | long_delay_pct |
---|---|---|---|
FL | AirTran Airways Corporation | 0.5418 | 0.0605 |
WN | Southwest Airlines Co. | 0.5392 | 0.2330 |
F9 | Frontier Airlines Inc. | 0.4128 | 0.5174 |
UA | United Air Lines Inc. | 0.4035 | 0.2326 |
OO | SkyWest Airlines Inc. | 0.3306 | 0.2141 |
US | US Airways Inc. | 0.3292 | 0.0396 |
MQ | Envoy Air | 0.3208 | 0.1615 |
VX | Virgin America | 0.3195 | 0.0888 |
DL | Delta Air Lines Inc. | 0.2931 | 0.0648 |
EV | ExpressJet Airlines Inc. | 0.2808 | 0.2601 |
We’re now done with the SQL part of the process!
Now that we have a small dataset, we can turn it into an R data frame and do our finishing wrangling touches in dplyr
and our visualization in ggplot2
.
query
.query
string, create an R data frame that contains the relevant information with db %>% dbGetQuery(query) %>% collect()
. The use of collect()
here brings the actual data from the table into memory; not just a pointer to a “view” of the data on the remote server, so don’t do this until you know that your query produces the small result set you want.## carrier name short_delay_pct long_delay_pct
## 1 FL AirTran Airways Corporation 0.5418 0.0605
## 2 WN Southwest Airlines Co. 0.5392 0.2330
## 3 F9 Frontier Airlines Inc. 0.4128 0.5174
## 4 UA United Air Lines Inc. 0.4035 0.2326
## 5 OO SkyWest Airlines Inc. 0.3306 0.2141
## 6 US US Airways Inc. 0.3292 0.0396
## 7 MQ Envoy Air 0.3208 0.1615
## 8 VX Virgin America 0.3195 0.0888
## 9 DL Delta Air Lines Inc. 0.2931 0.0648
## 10 EV ExpressJet Airlines Inc. 0.2808 0.2601
## 11 AA American Airlines Inc. 0.2647 0.0487
## 12 B6 JetBlue Airways 0.2535 0.6416
## 13 AS Alaska Airlines Inc. 0.1480 0.0072
## 14 HA Hawaiian Airlines Inc. 0.0860 0.0054
Getting the airline names to display as they are in the graph will require some text manipulation. For one thing, we want to strip away the formal company identifiers like Co.
and Inc.
. Moreover, we don’t really need the Airlines
and Airways
bits.
We can use the gsub()
function to find these substrings and replace them. The syntax for this uses what’s called “regular expressions”; essentially a pattern matching language that lets us define a variety of more or less general “wildcards”. Here are the gsub()
commands we want.
delays <- delays %>%
## Replaces Airlines, Airways or Air Lines with nothing
mutate(name = gsub("Air(lines|ways| Lines)", "", name)) %>%
## Replaces Inc., Co., or Corporation with nothing
## Note the need for \\ before a period; the period
## has a special meaning in regular expressions
mutate(name = gsub("(Inc\\.|Co\\.|Corporation)", "", name)) %>%
## Removes any parentheticals. Here we see the "special"
## meaning of the period: it stands for "match any single
## character" (except a line-break).
## It is followed by a *, which acts as a
## 'modifier' on the previous string element; in this case
## the period: the modifier says "match zero or more
## instances of the preceding thing". So .* will match
## basically any string that doesn't include a line
## break.
mutate(name = gsub("\\(.*\\)", "", name)) %>%
## The $ is a special character that represents
## the end of a line; and the + is like the *,
## except it will only match if there is at least
## one instance of the preceding character.
## So the pattern below will match one or more
## spaces that occur at the end of a string.
## We replace that whitespace with nothing.
mutate(name = gsub(" *$", "", name))
delays
## carrier name short_delay_pct long_delay_pct
## 1 FL AirTran 0.5418 0.0605
## 2 WN Southwest 0.5392 0.2330
## 3 F9 Frontier 0.4128 0.5174
## 4 UA United 0.4035 0.2326
## 5 OO SkyWest 0.3306 0.2141
## 6 US US 0.3292 0.0396
## 7 MQ Envoy Air 0.3208 0.1615
## 8 VX Virgin America 0.3195 0.0888
## 9 DL Delta 0.2931 0.0648
## 10 EV ExpressJet 0.2808 0.2601
## 11 AA American 0.2647 0.0487
## 12 B6 JetBlue 0.2535 0.6416
## 13 AS Alaska 0.1480 0.0072
## 14 HA Hawaiian 0.0860 0.0054
You’ll notice that you get more airlines than appear in the graph. We could do some consolidation, but we’ll set this aside for now.
For visualizing this data, we’ll want to create a single pct
column representing percentages of both delay categories, and a delay_type
column indicating whether the percentage refers to short or long delays. This is a job for pivot_longer()
.
Remember that pivot_longer()
has a cols=
argument where we specify the names of columns we’re “stacking”, a names_to
argument where we specify a name for the new grouping variable, and a values_to=
argument where we specify a name for the new merged quantitative variable.
pivot_longer()
to obtain a dataset that has carrier
, name
, delay_type
(short or long), pct_of_type
.## # A tibble: 28 x 4
## carrier name delay_type pct_of_type
## <chr> <fct> <chr> <dbl>
## 1 HA Hawaiian short 0.086
## 2 HA Hawaiian long 0.0054
## 3 AS Alaska short 0.148
## 4 AS Alaska long 0.0072
## 5 B6 JetBlue short 0.254
## 6 B6 JetBlue long 0.642
## 7 AA American short 0.265
## 8 AA American long 0.0487
## 9 EV ExpressJet short 0.281
## 10 EV ExpressJet long 0.260
## # … with 18 more rows
Most of the work is done now; all that remains is to actually produce the bar graph. Some tips here: don’t forget to use stat="identity"
with geom_bar()
, since we have precomputed the percentages.
You may need to reorder the label variable so the airlines appear in descending order of short delays. You can do this with reorder(<variable to be reordered>, <variable to define the ordering>, max)
(don’t type the angle brackets). This iteratively applies the max
function to the values of the <variable to define the ordering>
, and then finds the matching value of the <variable to be reordered>
, and puts it first in the order. Then it does it again with the values that remain. Etc.
+ coord_flip()
in your ggplot
chain to turn it sideways. (Note: To get the percentages to display on the bars the way they are in the FiveThirtyEight graph, we can use geom_text()
. However, it’s a bit awkward to specify the “y” coordinate (which will become the “x” coordinate after flipping the coordinates) for the labels from the long data, since the labels for the longer delays are positioned based on the percentage of short delays. So as a hack you may want to join the long and wide forms of the data so you have the percent of short delays in every row.) Use scale_y_continuous()
to set the limits on the “y” axis. Load the ggthemes
library and use theme_fivethirtyeight()
to get closer to the aesthetic they use.#lab15
in Slack