Integrating SQL with R and ggplot2

SQL vs dplyr

Now that you (are on your way to) know(ing) two different data wrangling languages (dplyr and SQL), it’s worth spending a minute thinking about their relative strengths and weaknesses. Here are a few strengths of each one:

dplyr

  • Runs on the same computer you’re working on
  • Has a “functional programming” style: Each wrangling verb has a clear input and output (whether this is a strength depends on your attitude toward functional programming, I suppose)
  • Is integrated into the larger tidyverse
  • Is more flexible: can combine standard verbs with arbitrary R code, incorporate code in custom functions, etc

SQL

  • Is more efficient for large datasets
  • Code reads (sort of) like English sentences (this is also a matter of taste; the code is probably ‘easier on the eyes’ than R code, but I find it harder to pull apart than the functional style of R; but… your mileage may vary)
  • Is more widely used in industry outside data science specifically (though R and the tidyverse have gotten far more widely used in the last few years)

There are two big reasons we can’t rely exclusively on SQL, however:

  • it doesn’t support plotting
  • it doesn’t support statistical modeling

So if we want to do more than show some tables, we need to be able to pass the results of our SQL queries back into R, so we can create graphs and (though we’re not focusing on this in this class) models.

Goal of this lab

Use SQL together with ggplot to produce visualizations from large datasets.

In particular, we will try to verify the following claim from the FiveThirtyEight article here:

“In 2014, the 6 million domestic flights the U.S. government tracked required an extra 80 million minutes to reach their destinations. The majority of flights – 54 percent – arrived ahead of schedule in 2014. (The 80 million minutes figure cited earlier is a net number. It consists of about 115 million minutes of delays minus 35 million minutes saved from early arrivals.)”

as well as to reproduce the graphic therein (shown below).