Deadlines (note: these are revised from what was originally posted)


We are continuing to build on the components of previous projects. The finished product is much the same as the first two projects: you will write a blog post style writeup investigating a question of interest using data, with the aid of a small number of well constructed visualizations.

As with the second project, you will need to do some wrangling of the data to get it ready for visualization.

The added component involved in this project is that you will use datasets that are too big to read into R all at once, and so you will need to construct some efficient SQL queries to read in only the information you need for your visualizations, followed by (possibly) additional data-wrangling in dplyr.

Required skills

Relevant background and resources


The final writeup

The final result of the project will be much the same as in previous projects: Your group will work together to write a blog post that contains one or more data graphics that tell the reader something interesting about the domain that the data comes from. For a rough word count, aim for something in the ballpark of 800-1000 words, not counting the methods section; a little bit more meaty than last time. The suggested structure of the writeup is mostly the same as in previous projects:

  • An introductory section that sets up the context, “hooks” the reader – make them care about your topic! – and introduces the datasets and their sources. Tell the reader what the units of observation (“cases”) are, and what the relevant variables are. Don’t just point the reader to the data via a link: describe it in the text!

  • For each graphic you include, a paragraph or two discussing what the graphic shows, including a concise “takehome” message in one or two sentences

  • A discussion section at the end of the main body, summarizing what we learn from your analysis (in non-technical terms! Speak to a reader who doesn’t know all this technical stuff), and reminding the reader why they should care.

  • A “methodology” section (separate from the body of the main blog post, but included in the same .Rmd as an “Appendix” section) that explains the process you had to go through to get your data in a form such that you could visualize what you wanted to visualize. Don’t just rehash what the code does; explain the goals you were trying to achieve with your wrangling and visualization steps, in a way that can be a bit more technical than the body of your writeup, but that doesn’t require the reader to know specific SQL, dplyr, or ggplot commands. In addition to describing your choices in the methodology section of the writeup, you should include comments in your code explaining what each step does and why.

Data-Wrangling Requirements

  • (At least some of) your visualizations should pull together data from multiple data tables via a join operation.
  • (At least) initial wrangling must be done using SQL, as efficiently as possible. You may want to do some touching up of your data tables in dplyr but make sure you are not importing more than you need into memory.

Data Visualization Requirements

Prioritize quality over quantity. A single really well thought out graph that took a fair amount of work is much better than several hasty ones. You will need to make a judgment call about when it is better to overlay information in one graph and when it is better to put the information in separate graphs. Naturally the graphs must be created within your .Rmd using ggplot2, and you should endeavor to do whatever wrangling is necessary beforehand so that the visualization pipeline is as well structured and concise as possible.

Even if you are doing some of your wrangling in dplyr, keep your wrangling pipeline(s) separate from your visualization pipeline(s), for readability’s sake.


The scidb server has the following databases. You will probably want to investigate something using one of these, although if you find other data sources that you can access via SQL, be my guest.

  • airlines: on-time flight data from the Bureau of Transportation Statistics
  • citibike: trip-level data from New York City’s municipal bike rental system
  • fec: campaign finance data from the Federal Election Commission
  • imdb: a copy of the Internet Movie Database
  • lahman: historical season-level baseball statistics
  • nyctaxi: ride-level data from New York City’s Taxi & Limousine Commission

The GitHub workflow

  • The GitHub clasroom link to create your project repo is here
  • See the Project 2 description for an outline of the recommended GitHub workflow.

Turning in your project

Collaboration on your project should again take place via GitHub commits. Your .Rmd source should be part of a GitHub repo from its inception, and changes should be recorded via commits from the account of the person who made the edit. Even if you are sitting in the same room together, make sure that commits are made from the person making the edit. Everyone in the group must make at least one commit.

Your final submission will consist of the .Rmd source, compiled .html, and any other files needed for the .Rmd to compile successfully. For example, if you are reading in the data from .csv files stored in your RStudio server account, commit this file. If you are reading data directly from a URL, this is not necessary.

Whatever state those files are in at the deadline is what I will grade.

Grading Rubric

The project will consist of 25 points, 20 assigned to the group as a whole and 5 assigned to individuals. The grade is based on the following criteria:

Code grade (7 pts)

  • +1: the .Rmd compiles successfully with no error messages
  • +1: code, unnecessary messages, and raw R output (other than the plots) are suppressed from the .html output
  • +2: data from multiple data tables is joined in SQL
  • +2: SQL queries are written efficiently (for example, you don’t fetch rows or columns that aren’t relevant to your analysis, and joins make use of unique ID keys)
  • +1: The visualization pipeline is kept as “clean” as possible by wrangling your data into a form that is conducive to simple visualization commands

Writeup grade (8 pts)

  • +1: The final post is the right length (800-1000ish words)
  • +1: At least one visualization is included, as generated by embedded code
  • +1: Each graphic is interpreted clearly and concisely in the text, including a “take-home” message in no more than two sentences
  • +1: Data-wrangling methodology is included in the appendix, and is clear
  • +1: Data-visualization methodology is included in the appendix, and is clear
  • +0-3: Subjective assessment of insightfulness and quality of the final product. This includes the quality of the content and presentation of the ideas in the introduction and discussion, and how effective your writeup is at making a hypothetical reader (who is not trained in all this technical stuff) care about your analysis.

Process grade (5 pts)

  • +1: All group members meet in person as soon as posible, to outline the project, identify datasets, list intended visualizations, and start doing some data-wrangling, before the workshop day. This meeting should be briefly described on the group Slack channel, noting any member absences.
  • +1: The GitHub workflow is followed, and informative commit messages (that describe what changes were made in that commit) are included.
  • +1: A Slack discussion is held as soon after the workshop as possible, to plan how remaining work will be done before the presentation day.
  • +1: The project is at the stage of a “complete draft” by the date of the presentation
  • +1: All group members meet at least one more time in person as soon after the presentations as possible to discuss what revisions are needed, and create a work plan for the final writeup.

Individual grade (5 pts)

  • +1: your individual contributions are clearly documented with commits in GitHub
  • +0-2: you show up at the initial planning meeting, the workshop day, and the final planning meeting, and contribute to Slack discussions and to the presentation
  • +0-2: you do your share of contributing to the final result