Basic query components in SQL (cont’d)

Goal

Learn SQL equivalents of data-wrangling tools beyond the basic “five verbs”; particularly “joins”, and get some practice combining the fundamental ingredients into productive queries – and getting used to some of the “quirks” of SQL’s requirements about query structure – to access data from large databases stored on a remote server in a resource-efficient way.

Setting up the connection

We need to load packages and set up a connection to the server again. We’ll again do some work with the "airlines" database.

Code:

For convenience here is the list of basic verbs again (make sure you’ve copied the .png files from the originals folder to your project folder):

Image Source: Baumer et al. Modern Data Science with R.

Image Source: Baumer et al. Modern Data Science with R.

Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.

Here’s the dplyr to SQL translation summary again:

Image Source: Baumer et al. Modern Data Science with R

Image Source: Baumer et al. Modern Data Science with R

And, it bears repeating, in all caps this time:

IMPORTANT: ALWAYS USE LIMIT IN YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE!

One last reminder: to designate a code chunk “SQL”, use {sql connection=db} in the chunk options (where db is whatever you named your connection in a previous R code chunk) in place of the r that is usually there.

Exercises

The following exercises use tables from the imdb database instead of the airlines database, so you’ll need to open a new connection.

Relevant tables are:

  • name: records are people (actors, etc.)
  • title: records are works (movies, etc.)
  • cast_info: records are roles in works, indexed by person
  • char_name: records are roles in works, indexed by character

Open the connection first:

Check what tables are available

Displaying records 1 - 10
Tables_in_imdb
aka_name
aka_title
cast_info
char_name
comp_cast_type
company_name
company_type
complete_cast
info_type
keyword
  1. Find a movie of your choice in the title table. You can use WHERE <field> LIKE '%<sub string>%' to get entries that match part of a string. Tweak your query, possibly by refining it using other variable, for example, the year the movie came out, so that your query returns exactly one result.

SOLUTION

Let’s first look at the title table and see what the fields are

Displaying records 1 - 10
Field Type Null Key Default Extra
id int(11) NO PRI NA auto_increment
title text NO MUL NA
imdb_index varchar(12) YES NA
kind_id int(11) NO MUL NA
production_year int(11) YES NA
imdb_id int(11) YES MUL NA
phonetic_code varchar(5) YES MUL NA
episode_of_id int(11) YES MUL NA
season_nr int(11) YES MUL NA
episode_nr int(11) YES MUL NA

Now let’s get the movie Serenity (based on the TV show Firefly). It might take some playing around to return only the result we want.

1 records
id title imdb_index kind_id production_year imdb_id phonetic_code episode_of_id season_nr episode_nr series_years md5sum
4202518 Serenity NA 1 2005 NA S653 NA NA NA NA cb62d1f36eed91904ee57da27f8028ef

Our goal over the next few exercises will be to obtain a query that returns the full filmography of the incomparable Viola Davis arranged in chronological order, showing the production year, title of the movie or show, and the name of the character she played.

If you want a challenge, try to do that now: it will involve some exploration of the data to find the relevant field names, formats, etc., and a few join operations. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database (you can see which fields these are by using DESCRIBE <table_name> (don’t actually type the angle brackets; this is just a convention that indicates that it’s a template slot to be filled in with an actual table name). Generally KEY fields are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.

If you want some more guidance, the next few exercises will break down the task into steps.

  1. Find out what person_id in the cast_info table corresponds to Viola Davis by consulting the name table. Note that the field may not have the same name in both tables!

SOLUTION

Again, let’s start by DESCRIBEing the name table to see what fields there are, and then list the first few entries to see how names are formatted.

9 records
Field Type Null Key Default Extra
id int(11) NO PRI NA auto_increment
name text NO MUL NA
imdb_index varchar(12) YES NA
imdb_id int(11) YES MUL NA
gender varchar(1) YES NA
name_pcode_cf varchar(5) YES MUL NA
name_pcode_nf varchar(5) YES MUL NA
surname_pcode varchar(5) YES MUL NA
md5sum varchar(32) YES MUL NA

It looks like the variable is called id in this table not person_id.

To figure out how names are represented, let’s look at the first few entries.

Displaying records 1 - 10
id name imdb_index imdb_id gender name_pcode_cf name_pcode_nf surname_pcode md5sum
235 -Alverio, Esteban Rodriguez NA NA m A4162 E2315 A416 f5c410bff6839b545d04c531f776e8f2
921 Aaberge, Theodor Olai NA NA m A1623 T3641 A162 eef277cb705ce78a3b41ed22b5d56292
1698 Aarudra NA NA m A636 NA NA 59b3b4b95e3223cb1471724d42c71654
1189 Aaltonen, Miro NA NA m A4356 M6435 A435 78cb32c803d9d05bc49d41f8abdeab49
2755 Abbas, Mubasshir NA NA m A1251 M1261 A12 3b94000fbde01c1acbdc72cc691ea579
2821 Abbasi, Kurosch NA NA m A1262 K6212 A12 1a375c48e4fa958d5f550caeeb18774c
779 A., Narayana Rao NA NA m A5656 N656 A c7a6b13ba56547bd0a1c352171b2d0c2
1482 Aaron, James II NA m A6525 J5265 A65 f9bc627c4dbe9e4b87e05e63e8c58deb
3094 Abbot Jr., Greg NA NA m A1326 G6213 NA f3c72bdb04a43aae7af1424c261d50ff
2942 Abbatiello, Don NA NA m A1343 D5134 A134 09e27d5c55428dab4714adafe8019a7d

Names are in Last, First format.

Now we’re ready to look up Viola Davis.

3 records
id name imdb_index imdb_id gender name_pcode_cf name_pcode_nf surname_pcode md5sum
2977373 Davis, Viola II NA f D1214 V4312 D12 cece125825cb648dcadaa25ed1f08cf2
2977372 Davis, Viola I NA f D1214 V4312 D12 4c0de69942ada8a98c00b22ec7c22ef4
4273298 Davis, Viola III NA NA D1214 V4312 D12 15ddfd05d8c7d49215107ece595dc2a3

Looks like there are three Viola Davises in IMDB. Looking at the IMDB website it looks like we want Viola Davis (I), so we can refine our query by filtering based on imdb_index.

1 records
id name imdb_index
2977372 Davis, Viola I

So the ID we want is 2977372.


  1. Get a table containing information about all of Viola Davis’s roles, using the cast_info table, showing her name in the output (requires a join). Don’t worry yet about showing the actual movie names or character names; for now we’ll just get a table with the relevant id codes. We can get the actual names with a join later.

SOLUTION

The usual glance at the table to check out variable names and formats:

7 records
Field Type Null Key Default Extra
id int(11) NO PRI NA auto_increment
person_id int(11) NO MUL NA
movie_id int(11) NO MUL NA
person_role_id int(11) YES MUL NA
note text YES NA
nr_order int(11) YES NA
role_id int(11) NO MUL NA
Displaying records 1 - 10
id person_id movie_id person_role_id note nr_order role_id
1 1 3432997 1 NA 31 1
2 2 1901690 2 NA NA 1
3 3 4027567 2 NA 25 1
4 3 4282876 3 NA 22 1
5 4 3542672 NA NA 12 1
6 5 3331520 4 (as $hutter Boy) 10 1
7 5 4027191 2 (as $hutter Boy) 1 1
8 5 4195731 5 (uncredited) NA 1
9 5 4263956 6 (uncredited) NA 1
10 5 4267787 7 (uncredited) NA 1

We have several ID columns here: id, person_id, movie_id, person_role_id and role_id. It’s a little hard to know what each of these is doing, but browsing through the table, it seems like id is just indexing individual entries in the table (so, every instance of a person playing a role in a single IMDB entry, even if they’ve played the role multiple times). Then, person_id is presumably specifying an individual person and movie_id is specifying the movie (or, presumably, other kind of title), person_role_id specifies a pairing of a person and a role, which could reoccur across, say, sequels, film adaptations of a TV show (or vice versa), etc. The first 100 entries all have role_id = 1… It’s not completely clear what this is doing, but we can probably do without it, since the information we need is in the other fields.

To get every role played by Viola Davis, we want entries with person_id = 2977372. To display her name next to each one, we’ll need to join with the name table.

Displaying records 1 - 10
movie_id role_id person_role_id name
3206068 2 418310 Davis, Viola
3206331 2 637 Davis, Viola
3206767 2 637 Davis, Viola
3208095 2 29580 Davis, Viola
3210140 2 29580 Davis, Viola
3210409 2 637 Davis, Viola
3210966 2 3265806 Davis, Viola
3211060 2 637 Davis, Viola
3213773 2 29580 Davis, Viola
3214760 2 637 Davis, Viola

This gives us movie ids and role ids for all of Viola Davis’s roles, but we can’t see the movie names or the role names yet. Further joinage is in order to get this info (that comes next).


  1. Add the character names and movie names to the previous output, by joining with char_name and title. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database: generally these are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.

SOLUTION

  1. Find Viola Davis’s full filmography, in chronological order. Include each movie’s title, production_year, and the name of the character that she played. There should be 49 entries in total if you include only movies.

SOLUTION