Goals

The goal of this project is to leverage your newfound data-wrangling skills to enable you to create data visualizations that you would not have been able to create without the prerequisite wrangling.

Required skills

Relevant background and resources

Instructions

The final writeup

The final result of the project will be much the same as in Project 1: Your group will work together to write a blog post that contains one or more data graphics that tell the reader something interesting about the domain that the data comes from. The same examples as before apply:

The 200-400 word range from last time was a bit tight (and nearly everyone exceeded it, but that’s ok); so this time, let’s aim for 500ish words, not counting the methods section. The suggested structure of the writeup is the same as last time:

  • An introductory paragraph that sets up the context and introduces the datasets and their sources. Tell the reader what the units of observation (“cases”) are, and what the relevant variables are. Don’t just point the reader to the data via a link: describe it in the text!

  • For each graphic you include, a paragraph or two discussing what the graphic shows, including a concise “takehome” message in one or two sentences

  • A “methodology” section (separate from the body of the main blog post, but included in the same .Rmd as an “Appendix” section) that explains the process of data-wrangling that you had to go through to get your data in a form such that you could visualize what you wanted to visualize. In addition, explain your visualization choices: why did you choose the type of graph (geometry) you did, why did you choose the aesthetic mapping you did, why did you choose the color scheme you did, etc. (In addition to describing your choices in the methodology section of the writeup, you should include comments in your code explaining what each step does and why.)

Data-Wrangling Requirements

  • (At least some of) your visualizations should pull together data from multiple data tables via a join operation.
  • In addition, your project should make non-trivial use of the five main “single table” verbs: filter(), select(), mutate(), arrange(), and summarize() (likely together with group_by()). The use of these verbs should be natural and motivated by the desired visualization, but I envision that you will make use of at least one summarize(), as well as at least two other verbs in combination.

Data Visualization Requirements

Your writeup might consist of just a single graph! I’d rather see one really well thought out graph that took a fair amount of work to create than several hasty ones. You can include any additional graphs that tell us something interesting about the data. You will need to make a judgment call about when it is better to overlay information in one graph and when it is better to put the information in separate graphs. Naturally the graphs must be created within your .Rmd using ggplot2, and you should endeavor to do whatever wrangling is necessary beforehand so that the visualization pipeline is as well structured and concise as possible.

I encourage you to keep your wrangling pipeline(s) separate from your visualization pipeline(s), for readability’s sake.

Data

You can use any data source you want, except that this time you may not use data directly from an R package. For full credit, you will need to join data from more than one table. The tables you join need not be from the same source; for example, you could join data about the same geographical entities, or time periods, etc., from different sources.

Some possible sources for data are:

The GitHub workflow

  • The first step is for every member of your group to “accept” the project assignment via this link on GitHub classroom. The first person to do so should create a new “team”; subsequent team members should join the existing team.
  • Next, everyone should clone the repo into RStudio by creating a new project from version control, pasting in the repo’s URL (ending in .git).
  • One person should create a new master .Rmd file, and commit and push it to GitHub. For consistency, name this file simply project2.Rmd.
  • All other team members should pull to get this file.
  • Each time you sit down to work on the project, pull before you do anything else. This will save you headaches.
  • Whenever you make an edit to any file and want to save it, remember to pull first, then stage and commit. If you’re ready to share it with your group, perform a push.
  • If you get an error upon pulling, likely it is because a file you have edited was changed by someone else, and GitHub couldn’t figure out how to reconcile the changes. You may need to go into the file and manually resolve the changes, then commit to merge them in the repo.
  • If you get an error upon committing or pushing, you may have forgotten to pull first. If not, you may need to resolve a conflict manually by going into the relevant file(s) and manually editing them to merge the changes.
  • Make sure you have coordinated who is doing what when with your group, to minimize the above sorts of problems.

Turning in your project

Collaboration on your project should again take place via GitHub commits. Your .Rmd source should be part of a GitHub repo from its inception, and changes should be recorded via commits from the account of the person who made the edit. Even if you are sitting in the same room together, make sure that commits are made from the person making the edit. Everyone in the group must make at least one commit.

Your final submission will consist of the .Rmd source, compiled .html, and any other files needed for the .Rmd to compile successfully. For example, if you are reading in the data from .csv files stored in your RStudio server account, commit this file. If you are reading data directly from a URL, this is not necessary.

Whatever state those files are in at the deadline is what I will grade.

Grading Rubric

The project will consist of 25 points, 20 assigned to the group as a whole and 5 assigned to individuals. The grade is based on the following criteria:

Code grade (7 pts)

  • +1: the .Rmd compiles successfully
  • +1: code, unnecessary messages, and raw R output (other than the plots) are suppressed from the .html output
  • +2: at least one summarize() and at least two other verbs are used
  • +2: at least one join (e.g., left_join(), inner_join(), full_join() is used
  • +1: Copying and pasting of code is avoided in favor of writing functions to handle repeated tasks

Writeup grade (8 pts)

  • +1: The final post is the right length (500ish words)
  • +1: At least one visualization is included, as generated by embedded code
  • +1: Each graphic is interpreted clearly and concisely in the text, including a “take-home” message in no more than two sentences
  • +1: Data-wrangling methodology is included in the appendix
  • +1: Data-visualization methodology is included in the appendix
  • +0-3: Subjective assessment of insightfulness and quality of the final product

Process grade (5 pts)

  • +1: All group members meet in person as soon as posible, to outline the project, identify datasets, list intended visualizations, and start doing some data-wrangling, before the workshop day on Wednesday 3/28. This meeting should be briefly described on the group Slack channel, noting any member absences.
  • +1: The GitHub workflow is followed, and informative commit messages (that describe what changes were made in that commit) are included.
  • +1: A Slack discussion is held as soon after the workshop as possible, to plan how remaining work will be done before the presentation day.
  • +1: The project is at the stage of a “complete draft” by the date of the presentation on 3/30
  • +1: All group members meet at least one more time in person as soon after the presentations as possible to discuss what revisions are needed, and create a work plan for the final writeup.

Individual grade (5 pts)

  • +1: your individual contributions are clearly documented with commits in GitHub
  • +0-2: you show up at the initial planning meeting, the workshop day, and the final planning meeting, and contribute to Slack discussions and to the presentation
  • +0-2: you do your share of contributing to the final result