Learn SQL equivalents of data-wrangling tools beyond the basic “five verbs”; particularly “joins”, and get some practice combining the fundamental ingredients into productive queries – and getting used to some of the “quirks” of SQL’s requirements about query structure – to access data from large databases stored on a remote server in a resource-efficient way.
We need to load packages and set up a connection to the server again. We’ll again do some work with the "airlines"
database.
Code:
library(dbplyr) ## Note the 'b'; this is not dplyr
library(mdsr) ## Package for our book
library(RMySQL) ## Standard R/SQL interface package
db <- dbConnect_scidb("airlines")
For convenience here is the list of basic verbs again (make sure you’ve copied the .png
files from the originals
folder to your project folder):
Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.
Here’s the dplyr
to SQL
translation summary again:
And, it bears repeating, in all caps this time:
IMPORTANT: ALWAYS USE LIMIT
IN YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE!
One last reminder: to designate a code chunk “SQL”, use {sql connection=db}
in the chunk options (where db
is whatever you named your connection in a previous R code chunk) in place of the r
that is usually there.
The following exercises use tables from the imdb
database instead of the airlines
database, so you’ll need to open a new connection.
Relevant tables are:
name
: records are people (actors, etc.)title
: records are works (movies, etc.)cast_info
: records are roles in works, indexed by personchar_name
: records are roles in works, indexed by characterOpen the connection first:
Check what tables are available
Tables_in_imdb |
---|
aka_name |
aka_title |
cast_info |
char_name |
comp_cast_type |
company_name |
company_type |
complete_cast |
info_type |
keyword |
title
table. You can use WHERE <field> LIKE '%<sub string>%'
to get entries that match part of a string. Tweak your query, possibly by refining it using other variable, for example, the year the movie came out, so that your query returns exactly one result.Let’s first look at the title
table and see what the fields are
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
id | int(11) | NO | PRI | NA | auto_increment |
title | text | NO | MUL | NA | |
imdb_index | varchar(12) | YES | NA | ||
kind_id | int(11) | NO | MUL | NA | |
production_year | int(11) | YES | NA | ||
imdb_id | int(11) | YES | MUL | NA | |
phonetic_code | varchar(5) | YES | MUL | NA | |
episode_of_id | int(11) | YES | MUL | NA | |
season_nr | int(11) | YES | MUL | NA | |
episode_nr | int(11) | YES | MUL | NA |
Now let’s get the movie Serenity (based on the TV show Firefly). It might take some playing around to return only the result we want.
SELECT * FROM title
WHERE title LIKE "%Serenity%"
AND kind_id = 1
AND production_year = 2005
LIMIT 0,20;
id | title | imdb_index | kind_id | production_year | imdb_id | phonetic_code | episode_of_id | season_nr | episode_nr | series_years | md5sum |
---|---|---|---|---|---|---|---|---|---|---|---|
4202518 | Serenity | NA | 1 | 2005 | NA | S653 | NA | NA | NA | NA | cb62d1f36eed91904ee57da27f8028ef |
Our goal over the next few exercises will be to obtain a query that returns the full filmography of the incomparable Viola Davis arranged in chronological order, showing the production year, title of the movie or show, and the name of the character she played.
If you want a challenge, try to do that now: it will involve some exploration of the data to find the relevant field names, formats, etc., and a few join operations. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database (you can see which fields these are by using DESCRIBE <table_name>
(don’t actually type the angle brackets; this is just a convention that indicates that it’s a template slot to be filled in with an actual table name). Generally KEY fields are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.
If you want some more guidance, the next few exercises will break down the task into steps.
person_id
in the cast_info
table corresponds to Viola Davis by consulting the name
table. Note that the field may not have the same name in both tables!Again, let’s start by DESCRIBE
ing the name
table to see what fields there are, and then list the first few entries to see how names are formatted.
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
id | int(11) | NO | PRI | NA | auto_increment |
name | text | NO | MUL | NA | |
imdb_index | varchar(12) | YES | NA | ||
imdb_id | int(11) | YES | MUL | NA | |
gender | varchar(1) | YES | NA | ||
name_pcode_cf | varchar(5) | YES | MUL | NA | |
name_pcode_nf | varchar(5) | YES | MUL | NA | |
surname_pcode | varchar(5) | YES | MUL | NA | |
md5sum | varchar(32) | YES | MUL | NA |
It looks like the variable is called id
in this table not person_id
.
To figure out how names are represented, let’s look at the first few entries.
id | name | imdb_index | imdb_id | gender | name_pcode_cf | name_pcode_nf | surname_pcode | md5sum |
---|---|---|---|---|---|---|---|---|
235 | -Alverio, Esteban Rodriguez | NA | NA | m | A4162 | E2315 | A416 | f5c410bff6839b545d04c531f776e8f2 |
921 | Aaberge, Theodor Olai | NA | NA | m | A1623 | T3641 | A162 | eef277cb705ce78a3b41ed22b5d56292 |
1698 | Aarudra | NA | NA | m | A636 | NA | NA | 59b3b4b95e3223cb1471724d42c71654 |
1189 | Aaltonen, Miro | NA | NA | m | A4356 | M6435 | A435 | 78cb32c803d9d05bc49d41f8abdeab49 |
2755 | Abbas, Mubasshir | NA | NA | m | A1251 | M1261 | A12 | 3b94000fbde01c1acbdc72cc691ea579 |
2821 | Abbasi, Kurosch | NA | NA | m | A1262 | K6212 | A12 | 1a375c48e4fa958d5f550caeeb18774c |
779 | A., Narayana Rao | NA | NA | m | A5656 | N656 | A | c7a6b13ba56547bd0a1c352171b2d0c2 |
1482 | Aaron, James | II | NA | m | A6525 | J5265 | A65 | f9bc627c4dbe9e4b87e05e63e8c58deb |
3094 | Abbot Jr., Greg | NA | NA | m | A1326 | G6213 | NA | f3c72bdb04a43aae7af1424c261d50ff |
2942 | Abbatiello, Don | NA | NA | m | A1343 | D5134 | A134 | 09e27d5c55428dab4714adafe8019a7d |
Names are in Last, First format.
Now we’re ready to look up Viola Davis.
id | name | imdb_index | imdb_id | gender | name_pcode_cf | name_pcode_nf | surname_pcode | md5sum |
---|---|---|---|---|---|---|---|---|
2977373 | Davis, Viola | II | NA | f | D1214 | V4312 | D12 | cece125825cb648dcadaa25ed1f08cf2 |
2977372 | Davis, Viola | I | NA | f | D1214 | V4312 | D12 | 4c0de69942ada8a98c00b22ec7c22ef4 |
4273298 | Davis, Viola | III | NA | NA | D1214 | V4312 | D12 | 15ddfd05d8c7d49215107ece595dc2a3 |
Looks like there are three Viola Davises in IMDB. Looking at the IMDB website it looks like we want Viola Davis (I), so we can refine our query by filtering based on imdb_index
.
SELECT id,name,imdb_index
FROM name
WHERE name LIKE '%Davis, Viola%'
AND imdb_index = "I"
LIMIT 0,10;
id | name | imdb_index |
---|---|---|
2977372 | Davis, Viola | I |
So the ID we want is 2977372
.
cast_info
table, showing her name in the output (requires a join). Don’t worry yet about showing the actual movie names or character names; for now we’ll just get a table with the relevant id codes. We can get the actual names with a join later.The usual glance at the table to check out variable names and formats:
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
id | int(11) | NO | PRI | NA | auto_increment |
person_id | int(11) | NO | MUL | NA | |
movie_id | int(11) | NO | MUL | NA | |
person_role_id | int(11) | YES | MUL | NA | |
note | text | YES | NA | ||
nr_order | int(11) | YES | NA | ||
role_id | int(11) | NO | MUL | NA |
id | person_id | movie_id | person_role_id | note | nr_order | role_id |
---|---|---|---|---|---|---|
1 | 1 | 3432997 | 1 | NA | 31 | 1 |
2 | 2 | 1901690 | 2 | NA | NA | 1 |
3 | 3 | 4027567 | 2 | NA | 25 | 1 |
4 | 3 | 4282876 | 3 | NA | 22 | 1 |
5 | 4 | 3542672 | NA | NA | 12 | 1 |
6 | 5 | 3331520 | 4 | (as $hutter Boy) | 10 | 1 |
7 | 5 | 4027191 | 2 | (as $hutter Boy) | 1 | 1 |
8 | 5 | 4195731 | 5 | (uncredited) | NA | 1 |
9 | 5 | 4263956 | 6 | (uncredited) | NA | 1 |
10 | 5 | 4267787 | 7 | (uncredited) | NA | 1 |
We have several ID columns here: id
, person_id
, movie_id
, person_role_id
and role_id
. It’s a little hard to know what each of these is doing, but browsing through the table, it seems like id
is just indexing individual entries in the table (so, every instance of a person playing a role in a single IMDB entry, even if they’ve played the role multiple times). Then, person_id
is presumably specifying an individual person and movie_id
is specifying the movie (or, presumably, other kind of title), person_role_id
specifies a pairing of a person and a role, which could reoccur across, say, sequels, film adaptations of a TV show (or vice versa), etc. The first 100 entries all have role_id = 1
… It’s not completely clear what this is doing, but we can probably do without it, since the information we need is in the other fields.
To get every role played by Viola Davis, we want entries with person_id = 2977372
. To display her name next to each one, we’ll need to join with the name
table.
SELECT
ci.movie_id,
ci.role_id,
ci.person_role_id,
n.name
FROM cast_info as ci
LEFT JOIN name as n ON ci.person_id = n.id
WHERE ci.person_id = 2977372
LIMIT 0,200;
movie_id | role_id | person_role_id | name |
---|---|---|---|
3206068 | 2 | 418310 | Davis, Viola |
3206331 | 2 | 637 | Davis, Viola |
3206767 | 2 | 637 | Davis, Viola |
3208095 | 2 | 29580 | Davis, Viola |
3210140 | 2 | 29580 | Davis, Viola |
3210409 | 2 | 637 | Davis, Viola |
3210966 | 2 | 3265806 | Davis, Viola |
3211060 | 2 | 637 | Davis, Viola |
3213773 | 2 | 29580 | Davis, Viola |
3214760 | 2 | 637 | Davis, Viola |
This gives us movie ids and role ids for all of Viola Davis’s roles, but we can’t see the movie names or the role names yet. Further joinage is in order to get this info (that comes next).
char_name
and title
. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database: generally these are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.title
, production_year
, and the name of the character that she played. There should be 49 entries in total if you include only movies.