Basic query components in SQL (cont’d)

Goal

Finish going through the SQL equivalents of the basic “five verbs” from dplyr, and practice using them to pull data from large databases stored on a remote server.

Setting up the connection

We need to load packages and set up a connection to the server again.

Code:

library(dbplyr)  ## Note the 'b'; this is not dplyr
library(mdsr)    ## Package for our book
library(RMySQL)  ## Standard R/SQL interface package
db <- dbConnect_scidb("airlines")

For convenience here is the list of basic verbs again:

Image Source: Baumer et al. Modern Data Science with R.

Image Source: Baumer et al. Modern Data Science with R.

Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.

Here’s the dplyr to SQL translation summary again:

Image Source: Baumer et al. Modern Data Science with R

Image Source: Baumer et al. Modern Data Science with R

And, it bears repeating, in all caps this time:

IMPORTANT: ALWAYS LIMIT YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE

One last reminder: to designate a code chunk “SQL”, use {sql connection=db} in the chunk options (where db is whatever you named your connection in a previous R code chunk) in place of the r that is usually there.

SHOW TABLES and DESCRIBE

We used dbListTables() to list the tables in a database using R code; the SQL equivalent of this is SHOW TABLES.

Code:

SHOW TABLES;
Tables_in_airlines
airports
carriers
flights
planes
summary
weather

To see what variables (“fields” in database lingo) are in a particular table, we can use DESCRIBE (similar to str() in R).

DESCRIBE flights;
Displaying records 1 - 10
Field Type Null Key Default Extra
year smallint(4) YES MUL NA
month smallint(2) YES NA
day smallint(2) YES NA
dep_time smallint(4) YES NA
sched_dep_time smallint(4) YES NA
dep_delay smallint(4) YES NA
arr_time smallint(4) YES NA
sched_arr_time smallint(4) YES NA
arr_delay smallint(4) YES NA
carrier varchar(2) NO MUL

Aggregation (equivalent of summarize())

SQL doesn’t have a verb equivalent to summarize(); just like with mutate() this gets handled by SELECT as well. We can ask for aggregated variables (which in dplyr is the job of summarize()) just as we can ask for elementwise transformations (the job of mutate()), using exactly the same syntax. For example, to calculate the average departure delay for all flights on June 29th, 2012, we can do

Code:

SELECT avg(dep_delay) AS avg_delay
FROM flights
WHERE year = 2012 AND month = 6 AND day = 29;
avg_delay
15.8233

Note that we don’t need a LIMIT here, since we’re aggregating the dataset to a single number.

(If you forgot that the SQL function for the average is avg() instead of mean() you can do

Code:

translate_sql(mean(dep_delay, na.rm = TRUE))
## <SQL> avg("dep_delay") OVER ()

though it turns out we don’t need the quotes, and we can leave out the OVER() clause since it’s empty anyway.

There are two ways to get the number of records being aggregated over (for which we would use n() in dplyr): either sum(1) or count(*):

Code:

SELECT 
  sum(1) AS N1,
  count(*) AS N2,  -- redundant; just to illustrate
  avg(dep_delay) as avg_delay
FROM flights
WHERE year = 2012 AND month = 6 AND day = 29;
1 records
N1 N2 avg_delay
18413 18413 15.8233

GROUP BY

Conveniently, the SQL verb equivalent to dplyr’s group_by() is also called GROUP BY. Except now it goes toward the end of the query, after the aggregations we want are specified, and we need to explicitly indicate that we want the grouping variable included in the output (this happened automatically in dplyr). To compute average departure delay on a specific day by carrier, and sort carriers in ascending order of mean delay:

SELECT 
  carrier,
  count(*) AS num_flights,
  avg(dep_delay) as avg_delay
FROM flights
WHERE year = 2012 AND month = 6 AND day = 29
GROUP BY carrier
ORDER BY avg_delay ASC;
Displaying records 1 - 10
carrier num_flights avg_delay
HA 226 1.5708
AS 457 2.1072
FL 663 3.7496
US 1192 3.9010
DL 2224 8.0692
F9 247 9.4372
YV 414 11.3237
OO 1846 12.1056
AA 1480 14.9311
EV 2327 17.1104

Notice that, unlike WHERE, the ORDER BY component here is sorting the output based on what shows up in the results, not what was in the original data.

  1. Suppose we want to restrict our results to bigger airlines; namely those with over 1000 flights that day. Modify the above query to achieve this. (Hint: the modification has nothing to do with the grouping or summarization component; you’ll need to think all the way back to Wednesday’s lab) Be cognizant of the “canonical order” of the verbs!

JOIN (and its variants)

The output above is useful if we know what airlines the two digit IDs correspond to. Some of them are easily recognized; others not so much. It would make our results more useful if we returned the actual full names of the carriers. Since this is information about a carrier not about a flight, it is stored in a different table; namely, carriers.

Code:

SHOW TABLES;
Tables_in_airlines
airports
carriers
flights
planes
summary
weather
DESCRIBE carriers;
2 records
Field Type Null Key Default Extra
carrier varchar(7) NO PRI
name varchar(255) NO

Looks like name is the variable/field we want.

The four join types we learned in dplyr have equivalents in SQL, which are summarized below. The equivalent of the by= argument that we use in dplyr to specify the column used to align the tables is the ON keyword.

dplyr SQL
inner_join() JOIN
left_join() LEFT JOIN
right_join() RIGHT JOIN
full_join() CROSS JOIN
by= ON

To attach the carrier’s full name to our result set from above, we could probably use any of these, but I think a left join makes the most sense (since we want to make sure to include each airline in our result set, even if for some reason its full name is missing from the carrier list; which it shouldn’t be in this data, but still).

Code:

SELECT 
  f.carrier,
  c.name,
  count(*) AS num_flights,
  avg(dep_delay) as avg_delay
FROM flights AS f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE year = 2012 AND month = 6 AND day = 29
GROUP BY carrier
HAVING num_flights > 1000
ORDER BY avg_delay ASC;
8 records
carrier name num_flights avg_delay
US US Airways Inc. 1192 3.9010
DL Delta Air Lines Inc. 2224 8.0692
OO SkyWest Airlines Inc. 1846 12.1056
AA American Airlines Inc. 1480 14.9311
EV ExpressJet Airlines Inc. 2327 17.1104
MQ Envoy Air 1419 19.1832
WN Southwest Airlines Co. 3434 24.2266
UA United Air Lines Inc. 1644 25.6448

Merging multiple queries: UNION (cf. bind_rows())

If we want to take two queries and merge their output into a single result set, we can simply concatenate the queries with the keyword UNION. For example, if there are two particular days of interest for which we want to compute some things, we can write queries for each one and merge them. Suppose for whatever reason we are interested in flights either on June 29th, 2012 or on October 13, 2014.

Code:

(SELECT 
  f.carrier,
  c.name,
  year, month, day,
  count(*) AS num_flights
FROM flights AS f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE year = 2012 AND month = 6 AND day = 29
GROUP BY carrier
HAVING num_flights > 1000)
UNION
(SELECT 
  f.carrier,
  c.name,
  year, month, day,
  count(*) AS num_flights
FROM flights AS f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE year = 2014 AND month = 10 AND day = 13
GROUP BY carrier
HAVING num_flights > 1000)
ORDER BY num_flights DESC;
Displaying records 1 - 10
carrier name year month day num_flights
WN Southwest Airlines Co. 2012 6 29 3434
WN Southwest Airlines Co. 2014 10 13 3390
DL Delta Air Lines Inc. 2014 10 13 2425
EV ExpressJet Airlines Inc. 2012 6 29 2327
DL Delta Air Lines Inc. 2012 6 29 2224
EV ExpressJet Airlines Inc. 2014 10 13 1962
OO SkyWest Airlines Inc. 2012 6 29 1846
OO SkyWest Airlines Inc. 2014 10 13 1746
UA United Air Lines Inc. 2012 6 29 1644
AA American Airlines Inc. 2014 10 13 1497

Often, perhaps most of the time, there will be a more concise way to write a query like this, by, for example, writing conjunctions or disjunctions of WHERE statements and/or adding additional variables to the GROUP BY clause. For example, the following is equvalent.

SELECT 
  f.carrier,
  c.name,
  year, month, day,
  count(*) AS num_flights
FROM flights AS f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE (year = 2012 AND month = 6 AND day = 29) OR
      (year = 2014 AND month = 10 AND day = 13)
GROUP BY carrier, year, month, day
HAVING num_flights > 1000
ORDER BY num_flights DESC;
Displaying records 1 - 10
carrier name year month day num_flights
WN Southwest Airlines Co. 2012 6 29 3434
WN Southwest Airlines Co. 2014 10 13 3390
DL Delta Air Lines Inc. 2014 10 13 2425
EV ExpressJet Airlines Inc. 2012 6 29 2327
DL Delta Air Lines Inc. 2012 6 29 2224
EV ExpressJet Airlines Inc. 2014 10 13 1962
OO SkyWest Airlines Inc. 2012 6 29 1846
OO SkyWest Airlines Inc. 2014 10 13 1746
UA United Air Lines Inc. 2012 6 29 1644
AA American Airlines Inc. 2014 10 13 1497

Exercises

The following exercises use tables from the imdb database instead of the airlines database, so you’ll need to open a new connection.

Relevant tables are:

  • name: records are people (actors, etc.)
  • title: records are works (movies, etc.)
  • cast_info: records are roles in works, indexed by person
  • char_info: records are roles in works, indexed by character
  1. Find a movie of your choice in the title table. You can use WHERE <field> LIKE '%<sub string>%' to get entries that match part of a string

  2. Find Viola Davis’s person_id in the name table.

  3. Get a list of Viola Davis’s roles, using the cast_info table, showing her name in the output (requires a join)

  4. Add the character names to the previous output, by joining with char_info.

  5. Find Viola Davis’s full filmography, in chronological order. Include each movie’s title, production_year, and the name of the character that she played.

Getting credit

DM me your query from exercise 6 (due Wednesday 4/18 by class time).