Basic query components in SQL

Goal

Learn the SQL equivalents of the basic “five verbs” from dplyr, and practice using them to pull data from large databases stored on a remote server.

Setting up the connection

Before we can interact with the data, we need to set up a connection to the server that hosts the database. This is similar to what you do when you set up your RStudio account to talk to the GitHub servers: you need to supply the address where the data is, and a set of credentials to log in to the remote server.

The database we’ll work with is hosted at Smith College where the first author of your textbook teaches; and the authors have provided a convenience function with a general use set of credentials to make connecting with that database quick and easy.

Code:

library(dbplyr)  ## Note the 'b'; this is not dplyr
library(mdsr)    ## Package for our book
library(RMySQL)  ## Standard R/SQL interface package
db <- dbConnect_scidb("airlines")

We can see what data tables are available with dbListTables().

Code:

dbListTables(db)
## [1] "airports" "carriers" "flights"  "planes"   "summary"  "weather"

Interacting with arbitrary databases

(Skip this section for now; come back to this later if you want to use SQL with data other than scidb)

For more general usage (that is, to interact with databases other than scidb at Smith), we can use the generic dbConnect() function. You can see how this is done by peeking at the source code or dbConnect_scidb():

Code

dbConnect_scidb
## function (dbname, ...) 
## {
##     dbConnect(RMySQL::MySQL(), dbname = dbname, host = "scidb.smith.edu", 
##         user = "mth292", password = "RememberPi")
## }
## <environment: namespace:mdsr>

So using this function with the argument “airlines” is equivalent to typing

dbConnect(RMySQL::MySQL(), 
    dbname = "airlines",
    host = "scidb.smith.edu",
    user = "mth292",
    password = "RememberPi")

This is fine in this case since mth292 is a read-only account that has been set up for anyone to use, and so privacy of credentials is not a big deal. However, for more general usage, it’s a good idea to store your credentials in a configuration file that you keep locally, instead of typing out your password in your source code.

The config file should be called .my.cnf (note the leading ., which is a convention for this sort of file; note that this makes it hidden if using a standard file browser), placed in your home directory, and be formatted as follows

[scidbAirlines]
dbname = "airlines"
host = "scidb.smith.edu"
user = "mth292"
password = "RememberPi"

where the part in square brackets can be any shorthand you want to use for this database. Then you can open the connection by typing

Code:

con <-  dbConnect(RMySQL::MySQL(), group = "scidbAirlines")

Constructing a tbl view of the dataset

Sometimes you can avoid having to write much SQL code by creating a “view” of the dataset that you can interact with as though it were an R-style data frame (technically an instance of the tbl class)

Here’s how to create a tbl view of the flights data table Code:

flights <- tbl(db, "flights")

(You could do the same with other tables from the list you printed out with dbListTables() above)

The basic SQL verbs

You can do a lot of data-wrangling by interacting with this tbl view, without ever writing a single line of SQL code. However, for the cases when that doesn’t work, let’s dive into writing basic SQL queries.

Here’s a summary list of the basic verbs and what they’re used for (reproduced from MDSR):

Image Source: Baumer et al. Modern Data Science with R.

Image Source: Baumer et al. Modern Data Science with R.

Note: SQL is less flexible than dplyr about what order the verbs show up in. The order in the above table is the canonical one, and verbs lower in the list must appear after verbs higher in the list.

Here’s a table summarizing how to translate between dplyr verbs and SQL verbs (also reproduced from MDSR):

Image Source: Baumer et al. Modern Data Science with R

Image Source: Baumer et al. Modern Data Science with R

Your first query: SELECT * FROM <table> LIMIT 0,<n>

To view the first few rows of the flights data without creating a tbl view first, we can use a SELECT * FROM <table> LIMIT 0,<n> construction (where n is the number of rows we want to view)

Caution: Be very careful never to run a command like the above without the LIMIT component unless you know for sure that the table you’re accessing is relatively small. Omitting this will cause your computer to try to retrieve and print the entirety of the database, which in this case is over 100 million records. This will likely crash your computer and also slow the server way down for everyone else.

In a Markdown document, you can create an executable raw SQL query by creating a code chunk that opens with {sql connection=db} (where db is whatever you named your connection in a previous R code chunk) in place of the r that is usually there.

Code:

SELECT * FROM flights LIMIT 0,10
Displaying records 1 - 10
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier tailnum flight origin dest air_time distance cancelled diverted hour minute time_hour
1987 10 1 3 1845 318 41 1920 321 CO NA 974 SPN GUM 0 129 0 0 18 45 1987-10-01 18:45:00
1987 10 1 3 5 -2 38 40 -2 DL NA 251 MCO PBI 0 142 0 0 0 5 1987-10-01 00:05:00
1987 10 1 3 2359 4 728 712 16 AL NA 602 LAX PIT 0 2136 0 0 23 59 1987-10-01 23:59:00
1987 10 1 4 9 -5 607 609 -2 AA NA 640 SMF ORD 0 1781 0 0 0 9 1987-10-01 00:09:00
1987 10 1 5 5 0 119 114 5 EA NA 57 LAX SFO 0 337 0 0 0 5 1987-10-01 00:05:00
1987 10 1 5 5 0 52 35 17 HP NA 351 ICT LAS 0 986 0 0 0 5 1987-10-01 00:05:00
1987 10 1 5 10 -5 41 50 -9 AA NA 583 PDX SEA 0 129 0 0 0 10 1987-10-01 00:10:00
1987 10 1 6 8 -2 534 517 17 UA NA 500 LAS ORD 0 1514 0 0 0 8 1987-10-01 00:08:00
1987 10 1 6 2340 26 34 15 19 WN NA 970 OKC TUL 0 111 0 0 23 40 1987-10-01 23:40:00
1987 10 1 7 15 -8 39 47 -8 UA NA 3 SMF OAK 0 75 0 0 0 15 1987-10-01 00:15:00

Note that in the above code, flights is referring to the table in the database, not the tbl variable we created above.

This is a lot of information even for just 10 cases. We can restrict the output to just the variables we care about by just listing their names separated by commas in place of the wildcard *. This is equivalent to the select() verb in dplyr. For example:

Code:

SELECT year,month,day,carrier,flight,origin,dest FROM flights
    LIMIT 0,10;
Displaying records 1 - 10
year month day carrier flight origin dest
1987 10 1 CO 974 SPN GUM
1987 10 1 DL 251 MCO PBI
1987 10 1 AL 602 LAX PIT
1987 10 1 AA 640 SMF ORD
1987 10 1 EA 57 LAX SFO
1987 10 1 HP 351 ICT LAS
1987 10 1 AA 583 PDX SEA
1987 10 1 UA 500 LAS ORD
1987 10 1 WN 970 OKC TUL
1987 10 1 UA 3 SMF OAK

Using WHERE to filter data

To restrict the output to certain cases, we use the WHERE verb (roughly equivalent to filter() in dplyr). As with filter() we can create conjunctions of filtering conditions; in SQL we just use the AND keyword. For example, to get only United flights on June 29, 2012, we can do

Code

SELECT year,month,day,carrier,flight,origin,dest 
FROM flights
WHERE year=2013 AND month=6 AND day=29 AND carrier='UA'
LIMIT 0,10; -- just to be safe (note the -- for comments!)
Displaying records 1 - 10
year month day carrier flight origin dest
2013 6 29 UA 1294 ONT IAH
2013 6 29 UA 368 PDX IAH
2013 6 29 UA 1481 SEA ORD
2013 6 29 UA 1202 LAX ORD
2013 6 29 UA 249 LAX IAH
2013 6 29 UA 1104 ANC DEN
2013 6 29 UA 369 SEA IAH
2013 6 29 UA 1238 SMF IAH
2013 6 29 UA 1197 SFO IAH
2013 6 29 UA 455 SFO LAS

Note the single = signs here, unlike in dplyr where we would use == in this context.

Filtering on variables not in the output

In dplyr if we want to use a variable as part of a filtering condition, it has to be part of the dataset at the time the filter occurs. For example, if I want to omit the year, month, day and carrier columns from the above dataset since I am only looking at data from one specific day and carrier, I would need to do the filter() before doing the select(); otherwise at the point when the filter() occurs, those variables are not present.

Code:

## Will give an error
flights %>%
  select(flight,origin,dest) %>%
  filter(year == 2012 & month == 6 & day == 29 & carrier == 'UA') %>%
  head(10)

## This works
flights %>%
  filter(year == 2012 & month == 6 & day == 29 & carrier == 'UA') %>%
  select(flight, origin, dest) %>%
  head(10)

In SQL, on the other hand, SELECT must always occur before WHERE in a query. However, we are allowed to refer to variables in a WHERE statement that are not in the output. In fact WHERE can only refer to variables in the original data, and cannot refer to variables calculated elsewhere in the query.

Code:

SELECT flight, origin, dest FROM flights
WHERE year = 2012 
  AND month = 6 -- can continue queries on the next line
  AND day = 29  -- indentation is just to facilitate readability
  AND carrier = 'UA'
LIMIT 0,10; -- finish the query with a semicolon
Displaying records 1 - 10
flight origin dest
262 ORD DEN
1741 SEA IAH
580 SFO BWI
1611 EWR IAH
710 ORD MSP
587 SFO SNA
1443 SFO IAH
1737 LAX EWR
522 SFO PDX
1173 IAD TPA

BETWEEN

To get flights from a particular date range, say June 25th through 30th, 2012, we can use BETWEEN with WHERE:

Code

SELECT carrier, flight FROM flights
WHERE year=2012
AND month=6
AND day BETWEEN 25 AND 30
LIMIT 0,10;
Displaying records 1 - 10
carrier flight
B6 580
EV 5730
UA 1482
B6 165
EV 4696
AA 1866
AA 700
FL 372
DL 1769
UA 1237

Creating new variables (SQL analog of mutate())

If, however, we wanted to specify a date range that spanned parts of two different months (say, June 15th through July 14th), this would be cumbersome to write using WHERE statements alone. We could say

Code:

SELECT month,day,carrier,flight FROM flights
WHERE year=2012
AND (month=6 AND day BETWEEN 15 AND 30) --note the parens for grouping
OR (month=7 AND day BETWEEN 1 and 14)
LIMIT 0,10;

but this is a bit awkward. Instead, we may want to create a new column that represents the date as a single number that we can reference.

There isn’t actually a verb in SQL that directly corresponds to mutate() in dplyr; it turns out we do this as part of the SELECT step, with the help of the keyword as which creates an “alias” for an expression.

The example below uses the str_to_date() function to translate year, month and date into a single value with which ordinal comparisons can be made.

Code:

-- This will produce an error
SELECT 
  str_to_date(concat(year, '-', month, '-', day), '%Y-%m-%d') as date,
  origin,dest,flight,carrier
FROM flights
WHERE carrier='UA' AND date BETWEEN '2012-06-16' AND '2012-07-14'
LIMIT 0,10

This produces an error! Why?

Filtering on calculated variables: HAVING

Remember we said above that WHERE only works with variables that exist in the original dataset? That means we can’t use date with WHERE, since date was calculated in our query.

Instead of WHERE, we need to use the verb HAVING, which works much the same way, but allows us to use calculated variables. The reason these are two different verbs is similar to why statically typed programming languages require you to specify what data type you will pass to an argument: if the SQL engine knows what type of variable you are passing in, it allows the query to be run more efficiently, which is increasingly important as datasets get larger.

Code:

SELECT 
  str_to_date(concat(year, '-', month, '-', day), '%Y-%m-%d') as date,
  origin,dest,flight,carrier
FROM flights
WHERE carrier='UA'
HAVING date BETWEEN '2012-06-16' AND '2012-07-14'
LIMIT 0,10
Displaying records 1 - 10
date origin dest flight carrier
2012-06-16 SFO ORD 236 UA
2012-06-16 SEA IAH 1741 UA
2012-06-16 LAX EWR 1000 UA
2012-06-16 SFO EWR 1175 UA
2012-06-16 ANC DEN 1104 UA
2012-06-16 ORD IAD 1251 UA
2012-06-16 SEA ORD 512 UA
2012-06-16 SFO IAH 1184 UA
2012-06-16 PDX IAH 1719 UA
2012-06-16 ORD CMH 1228 UA

Sorting with ORDER BY (cf. dplyr::arrange())

One last verb for today. To sort the output, we can use ORDER BY, which works like arrange() in dplyr. It has asc and desc options to control the sorting direction, and you can specify more than one clause to create nested sorts. For example, to sort descending by flight number within date, ascending:

Code:

SELECT 
  str_to_date(concat(year, '-', month, '-', day), '%Y-%m-%d') as date,
  origin,dest,flight,carrier
FROM flights
WHERE carrier='UA'
HAVING date BETWEEN '2012-06-16' AND '2012-07-14'
ORDER BY flight asc
LIMIT 0,10;