Basic query components in SQL (cont’d)

Goal

Learn SQL equivalents of data-wrangling tools beyond the basic “five verbs”; particularly “joins”, and get some practice combining the fundamental ingredients into productive queries – and getting used to some of the “quirks” of SQL’s requirements about query structure – to access data from large databases stored on a remote server in a resource-efficient way.

Setting up the connection

We need to load packages and set up a connection to the server again. We’ll again do some work with the "airlines" database.

Code:

library(dbplyr)  ## Note the 'b'; this is not dplyr
library(mdsr)    ## Package for our book
library(RMySQL)  ## Standard R/SQL interface package
db <- dbConnect_scidb("airlines")

For convenience here is the list of basic verbs again (make sure you’ve copied the .png files from the originals folder to your project folder):

Image Source: Baumer et al. Modern Data Science with R.

Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.

Here’s the dplyr to SQL translation summary again:

Image Source: Baumer et al. Modern Data Science with R

And, it bears repeating, in all caps this time:

IMPORTANT: ALWAYS USE LIMIT IN YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE!

One last reminder: to designate a code chunk “SQL”, use {sql connection=db} in the chunk options (where db is whatever you named your connection in a previous R code chunk) in place of the r that is usually there.

Exercises

The following exercises use tables from the imdb database instead of the airlines database, so you’ll need to open a new connection.

Relevant tables are:

name: records are people (actors, etc.)
title: records are works (movies, etc.)
cast_info: records are roles in works, indexed by person
char_name: records are roles in works, indexed by character

Open the connection first:

db <- dbConnect_scidb("imdb")

Check what tables are available

SHOW TABLES;

Displaying records 1 - 10
Tables_in_imdb
aka_name
aka_title
cast_info
char_name
comp_cast_type
company_name
company_type
complete_cast
info_type
keyword

Find a movie of your choice in the title table. You can use WHERE <field> LIKE '%<sub string>%' to get entries that match part of a string. Tweak your query, possibly by refining it using other variable, for example, the year the movie came out, so that your query returns exactly one result.

SOLUTION

Our goal over the next few exercises will be to obtain a query that returns the full filmography of the incomparable Viola Davis arranged in chronological order, showing the production year, title of the movie or show, and the name of the character she played.

If you want a challenge, try to do that now: it will involve some exploration of the data to find the relevant field names, formats, etc., and a few join operations. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database (you can see which fields these are by using DESCRIBE <table_name> (don’t actually type the angle brackets; this is just a convention that indicates that it’s a template slot to be filled in with an actual table name). Generally KEY fields are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.

If you want some more guidance, the next few exercises will break down the task into steps.

Find out what person_id in the cast_info table corresponds to Viola Davis by consulting the name table. Note that the field may not have the same name in both tables!

SOLUTION

Get a table containing information about all of Viola Davis’s roles, using the cast_info table, showing her name in the output (requires a join). Don’t worry yet about showing the actual movie names or character names; for now we’ll just get a table with the relevant id codes. We can get the actual names with a join later.

SOLUTION

Add the character names and movie names to the previous output, by joining with char_name and title. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database: generally these are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.

SOLUTION

Find Viola Davis’s full filmography, in chronological order. Include each movie’s title, production_year, and the name of the character that she played. There should be 49 entries in total if you include only movies.

STAT 209: SQL Part II

Basic query components in SQL (cont’d)

Goal

Setting up the connection

Exercises

SOLUTION

SOLUTION

SOLUTION

SOLUTION

SOLUTION