Basic query components in SQL

Goal

Learn the SQL equivalents of the basic “five verbs” from dplyr, and practice using them to pull data from large databases stored on a remote server.

Setting up the connection

Before we can interact with the data, we need to set up a connection to the server that hosts the database. This is similar to what you do when you set up your RStudio account to talk to the GitHub servers: you need to supply the address where the data is, and a set of credentials to log in to the remote server.

The database we’ll work with is hosted at Smith College where the first author of your textbook teaches; and the authors have provided a convenience function with a general-use set of credentials to make connecting with that database quick and easy.

Code:

We can see what data tables are available with dbListTables().

Code:

## [1] "airports" "carriers" "flights"  "planes"

Interacting with arbitrary databases

(You can skip this section for now; it’s here for reference if later you want to use SQL with data other than scidb)

For more general usage (that is, to interact with databases other than scidb at Smith), we can use the generic dbConnect() function. You can see how this is done by peeking at the source code or dbConnect_scidb():

Code

## function (dbname, ...) 
## {
##     DBI::dbConnect(RMySQL::MySQL(), dbname = dbname, host = "mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com", 
##         user = "mdsr_public", password = "ImhsmflMDSwR", ...)
## }
## <bytecode: 0x5652deb86f00>
## <environment: namespace:mdsr>

So using this function with the argument “airlines” is equivalent to typing

This is fine in this case since mdsr_public is a read-only account that has been set up for anyone to use, and so privacy of credentials is not a big deal. However, for more general usage, it’s a good idea to store your credentials in a configuration file that you keep locally, instead of typing out your password in your source code.

The config file should be a plain text file called .my.cnf and placed in your home directory (note the leading ., which is a convention for this sort of file; note that this makes it hidden if using a standard file browser). It should be formatted as follows

[scidbAirlines]
dbname = "airlines"
host = "mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com"
user = "mdsr_public"
password = "ImhsmflMDSwR"

where the part in square brackets can be any shorthand you want to use for this database. Then you can open the connection by typing

Code:

The resulting R object is equivalent to the object db we created above using the helper function that hardcoded the access credentials for us, and represents a connection to the remote database.

Constructing a tbl view of the dataset

Sometimes you can avoid having to write much SQL code by creating a “view” of the dataset that you can interact with as though it were an R data frame (technically an instance of the tbl class)

Here’s how to create a tbl view of the flights data table

Code:

(You could do the same with other tables from the list you printed out with dbListTables() above)

The basic SQL verbs

You can do a lot of data-wrangling by interacting with this tbl view, without ever writing a single line of SQL code. However, for the cases when that doesn’t work, let’s dive into writing basic SQL queries.

Here’s a summary list of the basic verbs and what they’re used for (reproduced from MDSR):

Image Source: Baumer et al. Modern Data Science with R.

Image Source: Baumer et al. Modern Data Science with R.

Note: SQL is less flexible than dplyr about what order the verbs show up in. The order in the above table is the canonical one, and verbs earlier in the list cannot appear after verbs later in the list. We won’t always use every verb, but the ones we do use have to be in order from the earliest one in the list to the latest. The only verbs that have to occur in every query are SELECT and FROM, which specify the fields (variables/columns) we want to return, and the table from which we want to get them, respectively. So the simplest query, which is equivalent to just printing a dataset in R, is

SELECT *
    FROM my_data

where the * is a “wildcard” that means we are selecting “everything”. That said, never do this — many SQL databases are very large, and if you write a query that is asking to print out all the data, there’s a chance your computer (or even the RStudio server) could run out of memory and crash. Especially if multiple people are using it at once!

Translating dplyr to SQL

Here’s a table summarizing how to translate between dplyr verbs and SQL verbs (also reproduced from MDSR):