STAT 209: Project 2 Description

Set-Up Instructions

Please have exactly one person in your group create project repo through the GitHub Classroom link here and create a team. Please name your team project2-groupX, where X is your group number.

Then, the other team members should use the same link and join the existing group.

Each person should then create a project in RStudio which is initialized using the GitHub URL that you are provided when you joined the project on GitHub Classroom.

Goals and New Requirements

The end deliverable of this project is the same as Project 1: Create a “blog-post” style writeup in the style of FiveThirtyEight or the NYTimes “The Upshot” column, centered on a related set of informative, accurate, and aesthetically pleasing data graphics illustrating something about a topic of your choice.

Now that we have more data-wrangling tools in our toolbox, however, we can be more flexible about the kinds of questions we can ask and the kinds of visualizations we can create. Specifically, this project adds the following requirements to those of Project 1:

At least one of your visualizations needs to involve a dataset that is constructed by combining data from two or more data files (that is, it should be created by using a join operation of some kind)
Your visualizations should collectively examine the data at multiple levels of granularity. That is, there should be some examination of variables at both the individual case level and at the level of a summary of multiple cases that share a characteristic (that is, you should use at least one group_by() and summarize() combo).

Beyond those requirements, you should use whatever tools are useful for examining the things you want to examine. Likely you will end up wanting to use a number of other data-wrangling tools, but you won’t necessarily need every tool we’ve learned in this class.

There may be things you want to do that aren’t easily done using the tools we’ve specifically covered — if that’s the case, you are welcome and encouraged to look into other tools that suit your purposes, but you should only do this if it would help you tell a more interesting story about the thing you’re studying, and only if you feel up to it.

In general, I will take the ambitiousness of your investigation into account when grading — an investigation that minimally meets the requirements will be held to a higher standard of technical precision than an investigation that goes somewhat beyond them — but please try to stay within the length guidelines.

Required skills

Proficiency with the ggplot2 package
Proficiency with the “five main verbs” of the dplyr package
Proficiency with join operations from the dplyr package
Proficiency with the restructuring verbs gather and spread from the tidyr package
Proficiency with the GitHub workflow

Relevant background and resources

Chs. 2-5 of the textbook
Core requirements: Slides and labs through Lab 10
Potentially useful: Labs 11-12 and associated content
ggplot2 Quick Reference
RMarkdown Quick Reference
dplyr Quick Reference
tidyr Quick Reference

Structure of the Write-Up (This section is identical to project 1 but is reproduced for convenience)

Your group will work together to write a blog post about a topic of your choice, with insights informed by data and illustrated via at least three related data graphics.

The following are some examples of the kind of structure I have in mind (though most of these are longer than your post will be).

Conciseness is of value here: aim for a post in the 700-1000 word range (not including the Methodology section at the end). A suggested structure is as follows (though, apart from the inclusion of the Methodology section at the end, you do not have to adhere to this exactly):

Introduction

Sets up the context and introduces the dataset and its source. Tell the reader what the cases in the data are, and what the relevant variables are. However, don’t just list these things: work them into one or more paragraphs that inform the reader about your data as though you were writing an article for a blog.

Headings describing each aspect of the topic you’re focused on

For each graphic you include, a paragraph or two discussing what the graphic shows, including a concise “takehome” message in one or two sentences. Again, don’t just show graphic, paragraph, graphic paragraph, …, connect your text and graphics in a coherent narrative.

Discussion

The last section of the main writeup should tie together the insights from the various views of the data you have created, and suggest open questions that were not possible to answer in the scope of this project (either because the relevant data was not available, or because of a technical hurdle that we have not yet learned enough to overcome)

Appendix: Methodology

This should be separate from the main narrative and should explain the technical details of your project for a reader interested in data visualization. Explain the choices you made in your graphic: why did you choose the types of graphs (geometries) that you did; why did you choose the aesthetic mappings you did, why did you choose the color schemes you did, etc.

Turning in your project (Identical to Project 1)

Collaboration on your project should take place via GitHub commits. Your .Rmd source should be part of a GitHub repo from its inception, and changes should be recorded via commits from the account of the person who made the edit. Everyone in the group should be making commits to the repo.

Your final submission will consist of

The .Rmd source
The compiled .html (or .pdf) file
Any other files needed for the .Rmd to compile successfully.

If your data is available on the web, prefer to read it directly from the web in your R code. If you needed to download and “clean up” the data outside of RStudio, and thus need to read it from a .csv file stored locally (that is, in your RStudioPro server account), commit this file if it is relatively small (no more than a few MB in size), and make sure that you are using a relative path to the file when you read in the data. If you have a local data file which is larger than a few MB, you can instead share it via Slack and include instructions in your GitHub README.md file that indicate where it should be placed.

Whatever state the files in your GitHub repo are in at the deadline is what I will grade.

Data

You can use any data sources you want, but you will need to combine data from at least two sources, so find datasets that have shared variables.

Some possible sources for data are:

The federal government’s Data.gov site
The American Psychological Association
The data science competition Kaggle
The UC Irvine machine learning repository
The Economics Network
Data provided by an R package, such as
- nycflights13: data about flights leaving from the three major NYC airports in 2013
- Lahman: comprehensive historical archive of major league baseball data
- fueleconomy: fuel economy data from the EPA, 1985–2015
- fivethirtyeight: provides access to data sets that drive many articles on FiveThirtyEight

You can find data anywhere else you like. But don’t use a dataset we’ve used in class or homework, and this time, at least some of your data must not come from an R package.

Caution!

Sometimes when a lot of people are reading in datasets and leave their RStudio sessions open, it can eat up a lot of memory on the server and slow things down. To minimize this issue, please close your RStudio project and sign out from the server (by clicking Sign Out in the upper right, not just closing your browser tab) after each session you spend working on it, so that the memory used by your session can be released.

Tips for `git` and GitHub

Each time you sit down to work on the project, pull before you do anything else. This will save you headaches.
Whenever you make an edit to any file and want to save it, pull first, then stage (add) and commit. If you’re ready to share it with your group, then push.
If you get an error upon pulling, committing or pushing likely it is because a file you have edited was changed by someone else, and GitHub couldn’t figure out how to reconcile the changes. Most of the time this can be prevented by pulling every time you sit down to work on it, but if not, you may need to go into the file and manually resolve the changes by finding the markup added by GitHub (look for >>>> and <<<<) and editing the file to keep what you want from each version, then commit to merge them in the repo and push. If this happens, notify your group members that you are undertaking a manual merge, so they do not continue to make edits in the mean time!
Make sure you have coordinated who is doing what when with your group, to minimize the above sorts of problems.

Grading Rubric

A suggested division of labor is that each group member is individually responsible for

at least one graphic
the part of the writeup and Methodology section directly pertaining to that graph

and the group as a whole works jointly on

the general Introduction and Discussion
any components of the Methodology section that pertain to the project as a whole.

Your group may choose to divide the work differently, but be sure that each person is involved in

the topic selection and planning stage
the coding component
the “general audience” writing element
the “technical writing” element.

Relevant SLOs

Data Science Workflow

A1: Demonstrate basic fluency with programming fundamentals
A2: Create clean, reproducible reports
- The graphics are generated by the code embedded in the .Rmd (not included from an external file)
- The .Rmd compiles successfully
- Code, unnecessary messages, and raw R output (other than the plots) are suppressed from the .html output
A3: Use a version control system for collaboration and documentation
- There is a GitHub record of commits
- The commit messages are concise and informative
A4: Produce clean, readable code
- Variable names are descriptive
- Line breaks and indentation are used to highlight the structure of the code

Understanding Visualization

B1: Identify variables, visual cues, and mappings between them
- The choices of aesthetic mappings and visual elements is motivated well in the Methodology section
B2: Identify key patterns revealed by a visualization
- Concise summaries of each individual visualization are included
B3: Identify strengths and weaknesses of particular visualizations
- The summaries highlight what the visualization shows clearly, what it doesn’t, some improvements that could be made with additional data or technical skills

Creating Visualizations

C1: Choose appropriate and effective graphical representations
- The visualizations chosen fit together to illustrate interesting features about the data
- The choices made for your visualizations are effective and allow information to be conveyed clearly and efficiently
C2: Employ informative annotation and visual cues to guide the reader
C3: Write clean, readable visualization code
- Pipe syntax is used to promote readability
- Line breaks and indentation are used to highlight the structure of the visualization code

Translating between the qualitative and the quantitative

D1: Choose suitable datasets to address questions of interest
D2: Describe what data suggests in language suitable for a general audience
D3: Extract “takehome messages” across multiple visualizations
- There is a description of “big picture” insights gained from considering the visualizations as a set
- The graphics used collectively convey aspects of the data that would have been difficult to notice with any single view

Data Wrangling

E1: Master “slicing and dicing” data to access needed elements (e.g., with filter(), select(), slice_max(), etc)
E2: Create new variables from old to serve a purpose (e.g., with mutate(), possibly involving other wrangling or cleaning functions within the definition of the new variables)
E3: Aggregate and summarize data within subgroups (e.g., with group_by() and summarize())
E4: Join data from multiple sources to examine relationships (with join() operations, and potentially with pivot_longer() or pivot_wider()

Intermediate Data Science Tools (Optional: you might not have a need for these, but if you do use them you can get an extra crack at showing mastery of them)

F1: Modularize repetitive tasks (e.g., by writing your own functions and/or using iteration constructs like lapply() or do())