Please have exactly one person in your group create project repo through the GitHub Classroom link here and create a team. Please name your team simply group1
, group2
, etc. using the group number of the Project 1 Slack channel you were assigned to.
Then, the other team members should use the same link and join the existing group.
This will do a few things: * It gives me easy access to your repos without you explicitly having to invite me, as well as the ability to check out your repos via script without much if any hand-tuning * It makes your repository private, since it is part of the ocstats
organization which has some extra features.
The goal of this project is to create a “blog-post” style writeup in the style of FiveThirtyEight or the NYTimes “The Upshot” column, centered on a related set of informative, accurate, and aesthetically pleasing data graphics illustrating something about a topic of your choice.
Proficiency with RMarkdown
, git
and the ggplot2
package
ggplot2
Quick Reference SheetYour group will work together to write a blog post about a topic of your choice, with insights informed by data and illustrated via at least three related data graphics.
The following are some examples of the kind of structure I have in mind (though most of these are longer than your post will be).
Conciseness is of value here: aim for a post in the 700-1000 word range (not including the Methodology section at the end). A suggested structure is as follows (though, apart from the inclusion of the Methodology section at the end, you do not have to adhere to this exactly):
Sets up the context and introduces the dataset and its source. Tell the reader what the cases in the data are, and what the relevant variables are. However, don’t just list these things: work them into one or more paragraphs that inform the reader about your data as though you were writing an article for a blog.
For each graphic you include, a paragraph or two discussing what the graphic shows, including a concise “takehome” message in one or two sentences. Again, don’t just show graphic, paragraph, graphic paragraph, …, connect your text and graphics in a coherent narrative.
The last section of the main writeup should tie together the insights from the various views of the data you have created, and suggest open questions that were not possible to answer in the scope of this project (either because the relevant data was not available, or because of a technical hurdle that we have not yet learned enough to overcome)
This should be separate from the main narrative and should explain the technical details of your project for a reader interested in data visualization. Explain the choices you made in your graphic: why did you choose the types of graphs (geometries) that you did; why did you choose the aesthetic mappings you did, why did you choose the color schemes you did, etc.
Collaboration on your project should take place via GitHub commits. Your .Rmd
source should be part of a GitHub repo from its inception, and changes should be recorded via commits from the account of the person who made the edit. Everyone in the group should be making commits to the repo.
Your final submission will consist of
.Rmd
source.html
(or .pdf
) file.Rmd
to compile successfully.If your data is available on the web, prefer to read it directly from the web in your R code. If you needed to download and “clean up” the data outside of RStudio, and thus need to read it from a .csv
file stored locally (that is, in your RStudioPro server account), commit this file if it is relatively small (no more than a few MB in size), and make sure that you are using a relative path to the file when you read in the data. If you have a local data file which is larger than a few MB, you can instead share it via Slack and include instructions in your GitHub README.md
file that indicate where it should be placed.
Whatever state the files in your GitHub repo are in at the deadline is what I will grade.
You can use any data source you want. For this first project, you are not expected to do much data wrangling, so you should spend minimal (if any) time manipulating the dataset. You might want to do some filter()
ing to select subsets, and/or mutate()
ing to create new variables, but that’s the extent of the wrangling you should do (and only do that if it’s needed for what you want to show!). If you are finding that creating the graphics you would need to create requires more involved wrangling, you might want to redefine your topic (but keep the original one in mind as a candidate for Project 2!)
Some possible sources for data are:
nycflights13
: data about flights leaving from the three major NYC airports in 2013Lahman
: comprehensive historical archive of major league baseball datafueleconomy
: fuel economy data from the EPA, 1985–2015fivethirtyeight
: provides access to data sets that drive many articles on FiveThirtyEightYou can find data anywhere else you like. But don’t use a dataset we’ve used in class or homework, and if you are using a dataset from an R package, ensure that you’re doing something different from what might be in the examples given in the documentation on the dataset.
To see a list of the datasets provided by a given R package, you can type the following at the console (fill in the package name).
Sometimes when a lot of people are reading in datasets and leave their RStudio sessions open, it can eat up a lot of memory on the server and slow things down. To minimize this issue, please close your RStudio project and sign out from the server (by clicking Sign Out in the upper right, not just closing your browser tab) after each session you spend working on it, so that the memory used by your session can be released.
A suggested division of labor is that each group member is individually responsible for
and the group as a whole works jointly on
Your group may choose to divide the work differently, but be sure that each person is involved in
.Rmd
(not included from an external file).Rmd
compiles successfully.html
output