The structure and spirit of this project is adapted from the SDS 192 “Intro to Data Science” course at Smith College, by Ben Baumer and Jordan Crouser.

Goals

The goal of this project is to create informative, accurate, and aesthetically pleasing data graphics.

Required skills

Proficiency with the ggplot2 package

Relevant background and resources

• Chs. 2 and 3 of the textbook
• DataCamp courses on ggplot2
• Slides and labs from 2/7 through 2/21
• ggplot2 Quick Reference Sheet
• RMarkdown Quick Reference

Instructions

Your group will work together to write a blog post that contains one or more data graphics that tell the reader something interesting about the domain that the data comes from. The following are some examples of the kind of structure I have in mind (though most of these are longer than your post will be).

Conciseness is of value here: aim for a post in the 200-400 word range. A suggested structure is as follows (though you do not have to adhere to this exactly):

• An introductory paragraph that sets up the context and introduces the dataset and its source. Tell the reader what the units of observation (“cases”) are, and what the relevant variables are.

• For each graphic you include, a paragraph or two discussing what the graphic shows, including a concise “takehome” message in one or two sentences

• A “methodology” section (separate from the main blog post) that explains the choices you made in your graphic: why did you choose the type of graph (geometry) you did, why did you choose the aesthetic mapping you did, why did you choose the color scheme you did

Turning in your project

Collaboration on your project should take place via GitHub commits. Your .Rmd source should be part of a GitHub repo from its inception, and changes should be recorded via commits from the account of the person who made the edit. Everyone in the group must make at least one commit.

Your final submission will consist of the .Rmd source, compiled .html, and any other files needed for the .Rmd to compile successfully. For example, if you are reading in the data from a .csv file stored in your RStudio server account, commit this file. If you are reading the dataset directly from an R package or from a URL, this is not necessary.

Whatever state those files are in at the deadline is what I will grade.

Data

You can use any data source you want. For this first project, you are not expected to do any data wrangling, so you should spend minimal (if any) time manipulating the dataset. You might want to do some filter()ing to select subsets, and/or mutate()ing to create new variables, but that’s the extent of the wrangling you should do (and only do that if it’s appropriate for what you want to show!)

Some possible sources for data are:

• The federal government’s Data.gov site
• The American Psychological nAssociation
• The data science competition Kaggle
• The UC Irvine machine learning repository
• The Economics Network
• Data provided by an R package, such as
• nycflights13: data about flights leaving from the three major NYC airports in 2013
• babynames: data about flights leaving from the three major NYC airports in 2013
• Lahman: comprehensive historical archive of major league baseball data
• fueleconomy: fuel economy data from the EPA, 1985–2015
• fivethirtyeight: provides access to data sets that drive many articles on FiveThirtyEight

To see a list of the datasets provided by a given R package, you can type the following at the console (fill in the package name).

The project will consist of 20 points, 17 assigned to the group as a whole and 3 to each individual. The grade is based on the following criteria:

Group grade: Basic (10 pts)

• +1: the .Rmd compiles successfully
• +1: a description of the dataset is provided
• +1: a graphic is included
• +2: the graphic is generated by the code embedded in the .Rmd (not included from an external file)
• +1: the visual (aesthetic) mapping is described in the text
• +1: the graphic includes relevant context (title, axis labels, etc.)
• +1: the blog post is not too long and not too short (roughly 200-400 words)
• +2: the graphic is interpreted clearly and concisely, including a “take-home” message in no more than two sentences

Group grade: Finishing touches (7 pts)

• +1: code, unnecessary messages, and raw R output (other than the plots) are suppressed from the .html output
• +1: the choices made are effective and allow information to be conveyed clearly and efficiently
• +2: the visualization choices are described in a “Methods” paragraph
• +0-3: subjective assessment of the overall quality and polish of your post

Individual grade (3 pts)

• +1: your individual contributions are clearly documented with commits in GitHub
• +0-2: you carried your weight in contributing to the project
data(package = "packagename")

Some of the above packages are not installed, so you will first need to install them with install.packages("packagename")