Now that you (are on your way to) know(ing) two different data wrangling languages (
SQL), it’s worth spending a minute thinking about their relative strengths and weaknesses. Here are a few strengths of each one:
There are two big reasons we can’t rely exclusively on SQL, however:
So if we want to do more than show some tables, we need to be able to pass the results of our SQL queries back into R, so we can create graphs and (though we’re not focusing on this in this class) models.
Use SQL together with
ggplot to produce visualizations from large datasets.
In particular, we will try to verify the following claim from the FiveThirtyEight article here:
“In 2014, the 6 million domestic flights the U.S. government tracked required an extra 80 million minutes to reach their destinations. The majority of flights – 54 percent – arrived ahead of schedule in 2014. (The 80 million minutes figure cited earlier is a net number. It consists of about 115 million minutes of delays minus 35 million minutes saved from early arrivals.)”
as well as to reproduce the graphic therein (shown below).