Learn SQL equivalents of data-wrangling tools beyond the basic “five verbs”; particularly “joins”, and get some practice combining the fundamental ingredients into productive queries – and getting used to some of the “quirks” of SQL’s requirements about query structure – to access data from large databases stored on a remote server in a resource-efficient way.
We need to load packages and set up a connection to the server again. We’ll again do some work with the "airlines"
database.
Code:
library(dbplyr) ## Note the 'b'; this is not dplyr
library(mdsr) ## Package for our book
library(RMySQL) ## Standard R/SQL interface package
db <- dbConnect_scidb("airlines")
For convenience here is the list of basic verbs again (make sure you’ve copied the .png
files from the originals
folder to your project folder):
Remember: Verbs lower in the list must appear after verbs higher in the list when constructing queries.
Here’s the dplyr
to SQL
translation summary again:
And, it bears repeating, in all caps this time:
IMPORTANT: ALWAYS USE LIMIT
IN YOUR QUERIES, LEST YOU TRY TO FETCH HUNDREDS OF MILLIONS OF RECORDS AND BREAK EVERYTHING FOR EVERYONE!
One last reminder: to designate a code chunk “SQL”, use {sql connection=db}
in the chunk options (where db
is whatever you named your connection in a previous R code chunk) in place of the r
that is usually there.
The following exercises use tables from the imdb
database instead of the airlines
database, so you’ll need to open a new connection.
Relevant tables are:
name
: records are people (actors, etc.)title
: records are works (movies, etc.)cast_info
: records are roles in works, indexed by personchar_name
: records are roles in works, indexed by characterOpen the connection first:
Check what tables are available
Tables_in_imdb |
---|
aka_name |
aka_title |
cast_info |
char_name |
comp_cast_type |
company_name |
company_type |
complete_cast |
info_type |
keyword |
title
table. You can use WHERE <field> LIKE '%<sub string>%'
to get entries that match part of a string. Tweak your query, possibly by refining it using other variable, for example, the year the movie came out, so that your query returns exactly one result.Our goal over the next few exercises will be to obtain a query that returns the full filmography of the incomparable Viola Davis arranged in chronological order, showing the production year, title of the movie or show, and the name of the character she played.
If you want a challenge, try to do that now: it will involve some exploration of the data to find the relevant field names, formats, etc., and a few join operations. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database (you can see which fields these are by using DESCRIBE <table_name>
(don’t actually type the angle brackets; this is just a convention that indicates that it’s a template slot to be filled in with an actual table name). Generally KEY fields are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.
If you want some more guidance, the next few exercises will break down the task into steps.
person_id
in the cast_info
table corresponds to Viola Davis by consulting the name
table. Note that the field may not have the same name in both tables!cast_info
table, showing her name in the output (requires a join). Don’t worry yet about showing the actual movie names or character names; for now we’ll just get a table with the relevant id codes. We can get the actual names with a join later.char_name
and title
. Note: When using SQL, you should only ever join by fields that are listed as a KEY in the database: generally these are id codes, which are much more efficient to operate over than text strings, and databases are set up to make filtering or comparing KEY fields fast.title
, production_year
, and the name of the character that she played. There should be 49 entries in total if you include only movies.