Basic query components in SQL (cont’d)

Goal

Learn SQL equivalents of data-wrangling tools beyond the basic “five verbs”; particularly “joins”, and get some practice combining the fundamental ingredients into productive queries – and getting used to some of the “quirks” of SQL’s requirements about query structure – to access data from large databases stored on a remote server in a resource-efficient way.

Setting up the connection

We need to load packages and set up a connection to the server again. We’ll again do some work with the "airlines" database.

Code:

library(dbplyr)  ## Note the 'b'; this is not dplyr
library(mdsr)    ## Package for our book
library(RMySQL)  ## Standard R/SQL interface package
db <- dbConnect_scidb("airlines")

For convenience here is the list of basic verbs again (make sure you’ve copied the .png files from the originals folder to your project folder):

Image Source: Baumer et al. Modern Data Science with R.

Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.

Here’s the dplyr to SQL translation summary again:

Image Source: Baumer et al. Modern Data Science with R

And, it bears repeating, in all caps this time:

IMPORTANT: ALWAYS USE LIMIT IN YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE!

One last reminder: to designate a code chunk “SQL”, use {sql connection=db} in the chunk options (where db is whatever you named your connection in a previous R code chunk) in place of the r that is usually there.

Exercises

The following exercises use tables from the imdb database instead of the airlines database, so you’ll need to open a new connection.

Relevant tables are:

name: records are people (actors, etc.)
title: records are works (movies, etc.)
cast_info: records are roles in works, indexed by person
char_name: records are roles in works, indexed by character

Open the connection first:

db <- dbConnect_scidb("imdb")

Check what tables are available

SHOW TABLES;

Displaying records 1 - 10
Tables_in_imdb
aka_name
aka_title
cast_info
char_name
comp_cast_type
company_name
company_type
complete_cast
info_type
keyword

Find a movie of your choice in the title table. You can use WHERE <field> LIKE '%<sub string>%' to get entries that match part of a string. Tweak your query, possibly by refining it using other variable, for example, the year the movie came out, so that your query returns exactly one result.

SOLUTION

Let’s first look at the title table and see what the fields are

DESCRIBE title;

Displaying records 1 - 10
Field	Type	Null	Key	Default	Extra
id	int(11)	NO	PRI	NA	auto_increment
title	text	NO	MUL	NA
imdb_index	varchar(12)	YES		NA
kind_id	int(11)	NO	MUL	NA
production_year	int(11)	YES		NA
imdb_id	int(11)	YES	MUL	NA
phonetic_code	varchar(5)	YES	MUL	NA
episode_of_id	int(11)	YES	MUL	NA
season_nr	int(11)	YES	MUL	NA
episode_nr	int(11)	YES	MUL	NA

Now let’s get the movie Serenity (based on the TV show Firefly). It might take some playing around to return only the result we want.

SELECT * FROM title
WHERE title LIKE "%Serenity%" 
    AND kind_id = 1
    AND production_year = 2005
LIMIT 0,20;

1 records
id	title	imdb_index	kind_id	production_year	imdb_id	phonetic_code	episode_of_id	season_nr	episode_nr	series_years	md5sum
4202518	Serenity	NA	1	2005	NA	S653	NA	NA	NA	NA	cb62d1f36eed91904ee57da27f8028ef

Our goal over the next few exercises will be to obtain a query that returns the full filmography of the incomparable Viola Davis arranged in chronological order, showing the production year, title of the movie or show, and the name of the character she played.

If you want a challenge, try to do that now: it will involve some exploration of the data to find the relevant field names, formats, etc., and a few join operations. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database (you can see which fields these are by using DESCRIBE <table_name> (don’t actually type the angle brackets; this is just a convention that indicates that it’s a template slot to be filled in with an actual table name). Generally KEY fields are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.

If you want some more guidance, the next few exercises will break down the task into steps.

Find out what person_id in the cast_info table corresponds to Viola Davis by consulting the name table. Note that the field may not have the same name in both tables!

SOLUTION

Again, let’s start by DESCRIBEing the name table to see what fields there are, and then list the first few entries to see how names are formatted.

DESCRIBE name;

9 records
Field	Type	Null	Key	Default	Extra
id	int(11)	NO	PRI	NA	auto_increment
name	text	NO	MUL	NA
imdb_index	varchar(12)	YES		NA
imdb_id	int(11)	YES	MUL	NA
gender	varchar(1)	YES		NA
name_pcode_cf	varchar(5)	YES	MUL	NA
name_pcode_nf	varchar(5)	YES	MUL	NA
surname_pcode	varchar(5)	YES	MUL	NA
md5sum	varchar(32)	YES	MUL	NA

It looks like the variable is called id in this table not person_id.

To figure out how names are represented, let’s look at the first few entries.

SELECT * FROM name LIMIT 0,10;

Displaying records 1 - 10
id	name	imdb_index	imdb_id	gender	name_pcode_cf	name_pcode_nf	surname_pcode	md5sum
235	-Alverio, Esteban Rodriguez	NA	NA	m	A4162	E2315	A416	f5c410bff6839b545d04c531f776e8f2
921	Aaberge, Theodor Olai	NA	NA	m	A1623	T3641	A162	eef277cb705ce78a3b41ed22b5d56292
1698	Aarudra	NA	NA	m	A636	NA	NA	59b3b4b95e3223cb1471724d42c71654
1189	Aaltonen, Miro	NA	NA	m	A4356	M6435	A435	78cb32c803d9d05bc49d41f8abdeab49
2755	Abbas, Mubasshir	NA	NA	m	A1251	M1261	A12	3b94000fbde01c1acbdc72cc691ea579
2821	Abbasi, Kurosch	NA	NA	m	A1262	K6212	A12	1a375c48e4fa958d5f550caeeb18774c
779	A., Narayana Rao	NA	NA	m	A5656	N656	A	c7a6b13ba56547bd0a1c352171b2d0c2
1482	Aaron, James	II	NA	m	A6525	J5265	A65	f9bc627c4dbe9e4b87e05e63e8c58deb
3094	Abbot Jr., Greg	NA	NA	m	A1326	G6213	NA	f3c72bdb04a43aae7af1424c261d50ff
2942	Abbatiello, Don	NA	NA	m	A1343	D5134	A134	09e27d5c55428dab4714adafe8019a7d

Names are in Last, First format.

Now we’re ready to look up Viola Davis.

SELECT *
FROM name
WHERE name LIKE '%Davis, Viola%'
LIMIT 0,10;

3 records
id	name	imdb_index	imdb_id	gender	name_pcode_cf	name_pcode_nf	surname_pcode	md5sum
2977373	Davis, Viola	II	NA	f	D1214	V4312	D12	cece125825cb648dcadaa25ed1f08cf2
2977372	Davis, Viola	I	NA	f	D1214	V4312	D12	4c0de69942ada8a98c00b22ec7c22ef4
4273298	Davis, Viola	III	NA	NA	D1214	V4312	D12	15ddfd05d8c7d49215107ece595dc2a3

Looks like there are three Viola Davises in IMDB. Looking at the IMDB website it looks like we want Viola Davis (I), so we can refine our query by filtering based on imdb_index.

SELECT id,name,imdb_index
FROM name
WHERE name LIKE '%Davis, Viola%'
  AND imdb_index = "I"
LIMIT 0,10;

1 records
id	name	imdb_index
2977372	Davis, Viola	I

So the ID we want is 2977372.

Get a table containing information about all of Viola Davis’s roles, using the cast_info table, showing her name in the output (requires a join). Don’t worry yet about showing the actual movie names or character names; for now we’ll just get a table with the relevant id codes. We can get the actual names with a join later.

SOLUTION

The usual glance at the table to check out variable names and formats:

DESCRIBE cast_info;

7 records
Field	Type	Null	Key	Default	Extra
id	int(11)	NO	PRI	NA	auto_increment
person_id	int(11)	NO	MUL	NA
movie_id	int(11)	NO	MUL	NA
person_role_id	int(11)	YES	MUL	NA
note	text	YES		NA
nr_order	int(11)	YES		NA
role_id	int(11)	NO	MUL	NA

SELECT * FROM cast_info
LIMIT 0,100;

Displaying records 1 - 10
id	person_id	movie_id	person_role_id	note	nr_order	role_id
1	1	3432997	1	NA	31	1
2	2	1901690	2	NA	NA	1
3	3	4027567	2	NA	25	1
4	3	4282876	3	NA	22	1
5	4	3542672	NA	NA	12	1
6	5	3331520	4	(as $hutter Boy)	10	1
7	5	4027191	2	(as $hutter Boy)	1	1
8	5	4195731	5	(uncredited)	NA	1
9	5	4263956	6	(uncredited)	NA	1
10	5	4267787	7	(uncredited)	NA	1

We have several ID columns here: id, person_id, movie_id, person_role_id and role_id. It’s a little hard to know what each of these is doing, but browsing through the table, it seems like id is just indexing individual entries in the table (so, every instance of a person playing a role in a single IMDB entry, even if they’ve played the role multiple times). Then, person_id is presumably specifying an individual person and movie_id is specifying the movie (or, presumably, other kind of title), person_role_id specifies a pairing of a person and a role, which could reoccur across, say, sequels, film adaptations of a TV show (or vice versa), etc. The first 100 entries all have role_id = 1… It’s not completely clear what this is doing, but we can probably do without it, since the information we need is in the other fields.

To get every role played by Viola Davis, we want entries with person_id = 2977372. To display her name next to each one, we’ll need to join with the name table.

SELECT 
  ci.movie_id,
  ci.role_id,
  ci.person_role_id,
  n.name
FROM cast_info as ci
LEFT JOIN name as n ON ci.person_id = n.id
WHERE ci.person_id = 2977372
LIMIT 0,200;

Displaying records 1 - 10
movie_id	role_id	person_role_id	name
3206068	2	418310	Davis, Viola
3206331	2	637	Davis, Viola
3206767	2	637	Davis, Viola
3208095	2	29580	Davis, Viola
3210140	2	29580	Davis, Viola
3210409	2	637	Davis, Viola
3210966	2	3265806	Davis, Viola
3211060	2	637	Davis, Viola
3213773	2	29580	Davis, Viola
3214760	2	637	Davis, Viola

This gives us movie ids and role ids for all of Viola Davis’s roles, but we can’t see the movie names or the role names yet. Further joinage is in order to get this info (that comes next).

Add the character names and movie names to the previous output, by joining with char_name and title. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database: generally these are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.

SOLUTION

Find Viola Davis’s full filmography, in chronological order. Include each movie’s title, production_year, and the name of the character that she played. There should be 49 entries in total if you include only movies.

STAT 209: SQL Part II

Basic query components in SQL (cont’d)

Goal

Setting up the connection

Exercises

SOLUTION

SOLUTION

SOLUTION

SOLUTION

SOLUTION