Learn SQL equivalents of data-wrangling tools beyond the basic “five verbs”; particularly “joins”, and get some practice combining the fundamental ingredients into productive queries – and getting used to some of the “quirks” of SQL’s requirements about query structure – to access data from large databases stored on a remote server in a resource-efficient way.
We need to load packages and set up a connection to the server again. We’ll again do some work with the "airlines"
database.
Code:
library(dbplyr) ## Note the 'b'; this is not dplyr
library(mdsr) ## Package for our book
library(RMySQL) ## Standard R/SQL interface package
db <- dbConnect_scidb("airlines")
For convenience here is the list of basic verbs again:
Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.
Here’s the dplyr
to SQL
translation summary again:
And, it bears repeating, in all caps this time:
IMPORTANT: ALWAYS LIMIT
YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE
One last reminder: to designate a code chunk “SQL”, use {sql connection=db}
in the chunk options (where db
is whatever you named your connection in a previous R code chunk) in place of the r
that is usually there.
JOIN
(and its variants)Recall that in the last part of the lab, you constructed a query to show the airlines with at least 1000 flights on June 29th, 2012 in order of the average arrival delay time.
Here is one way you might have written that query:
SELECT
carrier,
count(*) AS num_flights,
avg(dep_delay) as avg_delay
FROM flights
WHERE year = 2012 AND month = 6 AND day = 29
GROUP BY carrier
HAVING num_flights >= 1000
ORDER BY avg_delay ASC;
carrier | num_flights | avg_delay |
---|---|---|
US | 1192 | 3.9010 |
DL | 2224 | 8.0692 |
OO | 1846 | 12.1056 |
AA | 1480 | 14.9311 |
EV | 2327 | 17.1104 |
MQ | 1419 | 19.1832 |
WN | 3434 | 24.2266 |
UA | 1644 | 25.6448 |
The output above is useful if we know what airlines the two digit IDs correspond to. Some of them are easily recognized; others not so much. It would make our results more useful if we returned the actual full names of the carriers. Since this is information about a carrier not about a flight, it is stored in a different table; namely, carriers
.
To see the name of the tables available, we can use SHOW TABLES
; and to see the structure of the carriers
table, we can use DESCRIBE
.
Code:
Tables_in_airlines |
---|
airports |
carriers |
flights |
planes |
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
carrier | varchar(7) | NO | PRI | ||
name | varchar(255) | NO |
Looks like name
is the variable/field we want.
The four join types we learned in dplyr
have equivalents in SQL
, which are summarized below. The equivalent of the by=
argument that we use in dplyr
to specify the column used to align the tables is the ON
keyword.
dplyr |
SQL |
---|---|
inner_join() |
JOIN |
left_join() |
LEFT JOIN |
right_join() |
RIGHT JOIN |
full_join() |
CROSS JOIN |
by= |
ON |
To attach the carrier’s full name to our result set from above, we could probably use any of these, but I think a left join makes the most sense (since we want to make sure to include each airline in our result set, even if for some reason its full name is missing from the carrier list; which it shouldn’t be in this data, but still).
Here’s what that would look like as an SQL query:
Code:
SELECT
f.carrier,
c.name,
count(*) AS num_flights,
avg(dep_delay) as avg_delay
FROM flights AS f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE year = 2012 AND month = 6 AND day = 29
GROUP BY carrier
HAVING num_flights > 1000
ORDER BY avg_delay ASC;
carrier | name | num_flights | avg_delay |
---|---|---|---|
US | US Airways Inc. | 1192 | 3.9010 |
DL | Delta Air Lines Inc. | 2224 | 8.0692 |
OO | SkyWest Airlines Inc. | 1846 | 12.1056 |
AA | American Airlines Inc. | 1480 | 14.9311 |
EV | ExpressJet Airlines Inc. | 2327 | 17.1104 |
MQ | Envoy Air | 1419 | 19.1832 |
WN | Southwest Airlines Co. | 3434 | 24.2266 |
UA | United Air Lines Inc. | 1644 | 25.6448 |
UNION
(cf. bind_rows()
)If we want to take two queries and merge their output into a single result set, we can simply concatenate the queries with the keyword UNION
. For example, if there are two particular days of interest for which we want to compute some things, we can write queries for each one and merge them. Suppose for whatever reason we are interested in flights either on June 29th, 2012 or on October 13, 2014.
Code:
(SELECT
f.carrier,
c.name,
year, month, day,
count(*) AS num_flights
FROM flights AS f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE year = 2012 AND month = 6 AND day = 29
GROUP BY carrier
HAVING num_flights > 1000)
UNION
(SELECT
f.carrier,
c.name,
year, month, day,
count(*) AS num_flights
FROM flights AS f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE year = 2014 AND month = 10 AND day = 13
GROUP BY carrier
HAVING num_flights > 1000)
ORDER BY num_flights DESC;
carrier | name | year | month | day | num_flights |
---|---|---|---|---|---|
WN | Southwest Airlines Co. | 2012 | 6 | 29 | 3434 |
WN | Southwest Airlines Co. | 2014 | 10 | 13 | 3390 |
DL | Delta Air Lines Inc. | 2014 | 10 | 13 | 2425 |
EV | ExpressJet Airlines Inc. | 2012 | 6 | 29 | 2327 |
DL | Delta Air Lines Inc. | 2012 | 6 | 29 | 2224 |
EV | ExpressJet Airlines Inc. | 2014 | 10 | 13 | 1962 |
OO | SkyWest Airlines Inc. | 2012 | 6 | 29 | 1846 |
OO | SkyWest Airlines Inc. | 2014 | 10 | 13 | 1746 |
UA | United Air Lines Inc. | 2012 | 6 | 29 | 1644 |
AA | American Airlines Inc. | 2014 | 10 | 13 | 1497 |
Often, perhaps most of the time, there will be a more concise way to write a query like this, by, for example, writing conjunctions or disjunctions of WHERE
statements and/or adding additional variables to the GROUP BY
clause. For example, the following is equvalent.
SELECT
f.carrier,
c.name,
year, month, day,
count(*) AS num_flights
FROM flights AS f
LEFT JOIN carriers AS c ON f.carrier = c.carrier
WHERE (year = 2012 AND month = 6 AND day = 29) OR
(year = 2014 AND month = 10 AND day = 13)
GROUP BY carrier, year, month, day
HAVING num_flights > 1000
ORDER BY num_flights DESC;
carrier | name | year | month | day | num_flights |
---|---|---|---|---|---|
WN | Southwest Airlines Co. | 2012 | 6 | 29 | 3434 |
WN | Southwest Airlines Co. | 2014 | 10 | 13 | 3390 |
DL | Delta Air Lines Inc. | 2014 | 10 | 13 | 2425 |
EV | ExpressJet Airlines Inc. | 2012 | 6 | 29 | 2327 |
DL | Delta Air Lines Inc. | 2012 | 6 | 29 | 2224 |
EV | ExpressJet Airlines Inc. | 2014 | 10 | 13 | 1962 |
OO | SkyWest Airlines Inc. | 2012 | 6 | 29 | 1846 |
OO | SkyWest Airlines Inc. | 2014 | 10 | 13 | 1746 |
UA | United Air Lines Inc. | 2012 | 6 | 29 | 1644 |
AA | American Airlines Inc. | 2014 | 10 | 13 | 1497 |
but other times the union we want may not lend itself quite so easily to a concise conjunction or disjunction like this.
The following exercises use tables from the imdb
database instead of the airlines
database, so you’ll need to open a new connection.
Relevant tables are:
name
: records are people (actors, etc.)title
: records are works (movies, etc.)cast_info
: records are roles in works, indexed by personchar_name
: records are roles in works, indexed by characterOpen the connection first:
Tables_in_imdb |
---|
aka_name |
aka_title |
cast_info |
char_name |
comp_cast_type |
company_name |
company_type |
complete_cast |
info_type |
keyword |
title
table. You can use WHERE <field> LIKE '%<sub string>%'
to get entries that match part of a string.Sample solution
-- let's first look at the `title` table and see what the
-- fields are
SELECT * FROM title LIMIT 0,3;
id | title | imdb_index | kind_id | production_year | imdb_id | phonetic_code | episode_of_id | season_nr | episode_nr | series_years | md5sum |
---|---|---|---|---|---|---|---|---|---|---|---|
78460 | Adults Recat to the Simpsons (30th Anniversary) | NA | 7 | 2017 | NA | A3432 | 78406 | NA | NA | NA | 2ae09eed7d576cc2c24774fed5b18168 |
70273 | (2016-05-18) | NA | 7 | 2016 | NA | NA | 68058 | NA | NA | NA | 511dfc14cfff7589d29a95abb30cd66a |
60105 | (2014-04-11) | NA | 7 | 2014 | NA | NA | 59138 | NA | NA | NA | c6cdce7e667e07713e431805c407feed |
-- Now actually get the movie
SELECT * FROM title
WHERE title LIKE "Memento"
AND production_year = 2000
LIMIT 0,5;
id | title | imdb_index | kind_id | production_year | imdb_id | phonetic_code | episode_of_id | season_nr | episode_nr | series_years | md5sum |
---|---|---|---|---|---|---|---|---|---|---|---|
3944746 | Memento | NA | 1 | 2000 | NA | M53 | NA | NA | NA | NA | 4cd6aeb9bfe39114de62452d50c5bd5e |
person_id
in the name
table.Sample solution:
-- first peek at the table to see how names are
-- formatted
SELECT * FROM name LIMIT 0,3;
-- turns out the variable is called id in this table
-- not person_id,
-- and names are in Last, First format
id | name | imdb_index | imdb_id | gender | name_pcode_cf | name_pcode_nf | surname_pcode | md5sum |
---|---|---|---|---|---|---|---|---|
235 | -Alverio, Esteban Rodriguez | NA | NA | m | A4162 | E2315 | A416 | f5c410bff6839b545d04c531f776e8f2 |
921 | Aaberge, Theodor Olai | NA | NA | m | A1623 | T3641 | A162 | eef277cb705ce78a3b41ed22b5d56292 |
1698 | Aarudra | NA | NA | m | A636 | NA | NA | 59b3b4b95e3223cb1471724d42c71654 |
SELECT *
FROM name
WHERE name LIKE '%Davis, Viola%'
LIMIT 0,5;
-- hmm, there are three
-- Looking at IMDB we want Viola Davis (I)
-- Looks like the `I` is in imdb_index
id | name | imdb_index | imdb_id | gender | name_pcode_cf | name_pcode_nf | surname_pcode | md5sum |
---|---|---|---|---|---|---|---|---|
2977373 | Davis, Viola | II | NA | f | D1214 | V4312 | D12 | cece125825cb648dcadaa25ed1f08cf2 |
2977372 | Davis, Viola | I | NA | f | D1214 | V4312 | D12 | 4c0de69942ada8a98c00b22ec7c22ef4 |
4273298 | Davis, Viola | III | NA | NA | D1214 | V4312 | D12 | 15ddfd05d8c7d49215107ece595dc2a3 |
SELECT id,name,imdb_index
FROM name
WHERE name LIKE '%Davis, Viola%'
AND imdb_index = "I"
LIMIT 0,100;
id | name | imdb_index |
---|---|---|
2977372 | Davis, Viola | I |
cast_info
table, showing her name in the output (requires a join)Sample solution:
-- The usual glance at the table to check out
-- variable names and formats
-- Looks like roles are either role_id or person_role_id?
SELECT * FROM cast_info
LIMIT 0,3;
id | person_id | movie_id | person_role_id | note | nr_order | role_id |
---|---|---|---|---|---|---|
1 | 1 | 3432997 | 1 | NA | 31 | 1 |
2 | 2 | 1901690 | 2 | NA | NA | 1 |
3 | 3 | 4027567 | 2 | NA | 25 | 1 |
SELECT
ci.person_id,
ci.movie_id,
ci.role_id,
ci.person_role_id,
n.name
FROM cast_info as ci
LEFT JOIN name as n ON ci.person_id = n.id
WHERE ci.person_id = 2977372
LIMIT 0,5;
-- This gives us movie ids and role ids for all of
-- Viola Davis's roles, but we can't see the movie names
-- or the role names. Further joinage is in order to get
-- this info (that comes next).
person_id | movie_id | role_id | person_role_id | name |
---|---|---|---|---|
2977372 | 3206068 | 2 | 418310 | Davis, Viola |
2977372 | 3206331 | 2 | 637 | Davis, Viola |
2977372 | 3206767 | 2 | 637 | Davis, Viola |
2977372 | 3208095 | 2 | 29580 | Davis, Viola |
2977372 | 3210140 | 2 | 29580 | Davis, Viola |
char_info
.Sample solution:
SELECT * FROM char_name LIMIT 0,10;
-- OK, looks like char_name.id might be is equivalent to
-- cast_info.person_role_id
id | name | imdb_index | imdb_id | name_pcode_nf | surname_pcode | md5sum |
---|---|---|---|---|---|---|
9201 | U.S.S. Soldier | NA | NA | U2436 | S436 | ff2690cc3b3b8486ecc17fc597bc9d47 |
10644 | Count Rood | NA | NA | C5363 | R3 | 10c0099f53636b7a10e36ecd2b12b510 |
75784 | Himself - Chicago Bears | NA | NA | H5241 | B62 | 60ea011c0d7f7d4b12d4af090867b785 |
97621 | Samatan | NA | NA | S535 | NA | 2dc52183f09c00704d3c65be1ce22bc9 |
31691 | Man mistaken for alex | NA | NA | M5232 | A42 | 7c73d4f4faa6f1865fa5dc953fda635d |
22621 | Paulie Gigante | NA | NA | P4253 | G253 | 20152b2de417cdca2387e329f12bbc3d |
25308 | Lil Vito | NA | NA | L413 | V3 | 6330a6bc9354d06814c948ed04453d80 |
81018 | David’s uncle | NA | NA | D1325 | U524 | f373c29e5da0e060733736345649ea4e |
29837 | Septimus | NA | NA | S1352 | NA | 2c0f06f1cbdf55b561680ae231fbb2bd |
88446 | Emin, Nezahat’in Babasi | NA | NA | E5235 | B12 | 456ab69926d681d9c7e870f5f3b82d66 |
-- We'll add the movie names too, since that seems
-- useful to know.
-- We'll also omit the id codes and Viola Davis's name, since the whole
-- table is about her.
SELECT
t.title,
cn.name
FROM cast_info as ci
JOIN name as n ON ci.person_id = n.id
JOIN title as t ON ci.movie_id = t.id
LEFT JOIN char_name as cn ON ci.person_role_id = cn.id
WHERE ci.person_id = 2977372
AND t.kind_id = 1 -- only return movies
LIMIT 0,100;
title | name |
---|---|
Antwone Fisher | Eva May |
Beautiful Creatures | Amma |
Beyond All Boundaries | Hortense Johnson - Arsenal Worker |
Beyond Babyland | Narrator |
Black Theater Today: 2005 | NA |
Blackhat | Carol Barrett |
Custody | Martha Schulman |
Dark Girls | Herself |
Disturbia | Detective Parker |
Doubt | Mrs. Miller |
title
, production_year
, and the name of the character that she played.Sample solution:
DM me your query from exercise 5 (due Thursday 11/21 by class time).