NOTE: As with project 1, please have exactly one person in your group create project repo through the GitHub Classroom link here, and when asked to create a team, please name your team simply group1
, group2
, etc. (no capital letters, spaces, or punctuation). Then, the other team members should use the same link and join the existing group. This will do a few things: (1) It gives me easy access to your repos without you explicitly having to invite me, as well as the ability to check out your repos via script without much if any hand-tuning,and (2) It makes your repository private, since it is part of my ocstats
organization which has some extra features.
The goal of this project is to leverage your newfound data-wrangling skills to enable you to create data visualizations that you would not have been able to create without the prerequisite wrangling.
ggplot2
packagedplyr
packagedplyr
packagegather
and spread
from the tidyr
packageggplot2
Quick Referencedplyr
Quick Referencetidyr
Quick ReferenceThe final result of the project will be much the same as in Project 1: Your group will work together to write a blog post that contains one or more data graphics that tell the reader something interesting about the domain that the data comes from. The same examples as before apply:
Since you have more tools at your disposal to slice and dice data in creative ways, the writeup can be a little bit longer than before: say up to 1000 words (not counting the Methods section).
The suggested structure of the writeup is much the same as before:
An introduction that sets up the context and introduces the datasets and their sources. Tell the reader what the units of observation (“cases”) are, and what the relevant variables are. Don’t just point the reader to the data via a link: describe it in the text! My impression was that the intros were a bit terse last time (probably because I alluded to “an introductory paragraph”), so I encourage you to give a bit more context this time.
For each graphic you include, a paragraph or two discussing what the graphic shows, including a concise “takehome” message in one or two sentences
A “methodology” section (separate from the body of the main blog post, but included in the same .Rmd
as an “Appendix” section) that explains both the data-wrangling component (how did you get your data in a form such that you could visualize what you wanted to visualize) and the visualization component (why did you choose the type of graph (geometry) you did, why did you choose the aesthetic mapping you did, why did you choose the color scheme you did, etc.). Note: In addition to describing your choices in the methodology section of the writeup, you should include comments in your code explaining what each step does and why.
A discussion section that ties together the insights from the various views of the data you have created, and suggests open questions that were not possible to answer in the scope of this project (either because the relevant data was not available, or because of a technical hurdle that we have not yet learned enough to overcome)
*_join()
operation.filter()
, select()
, mutate()
, arrange()
, and summarize()
(likely together with group_by()
). The use of these verbs should be natural and motivated by the desired visualization, but I envision that you will make use of at least one group_by()
and summarize()
, as well as at least two other verbs in combination.gather()
and/or spread()
operations.do()
or lapply()
).Your writeup should consist of at least as many graphs as there are group members; but there’s a good chance you’ll want to include more than that (however, quality is more important than quantity). You will need to make a judgment call about when it is better to overlay information in one graph and when it is better to put the information in separate graphs.
As before, you can split up the work however you choose, but a natural split might be for each person to be responsible for one or two graphs.
Naturally the graphs must be created within your .Rmd
using ggplot2
, and you should endeavor to do whatever wrangling is necessary in a separate pipeline, so that the visualization pipeline is as well structured and concise (and readable!) as possible.
You can use any data source you want, except that you may not use a dataset we have used in a lab, nor can you use a dataset that is included in an R package (at least, not without running it by me first). For full credit, you will need to join data from more than one table. The tables you join need not be from the same source; for example, you could join data about the same geographical entities, or time periods, etc., from different sources.
Some possible sources for data are (as before):
.git
)..Rmd
file, and commit and push all the new files to GitHub. For consistency, name the .Rmd
file simply project2.Rmd
.Collaboration on your project should take place via GitHub commits. Your .Rmd
source should be part of a GitHub repo from its inception, and changes should be recorded via commits from the account of the person who made the edit. Everyone in the group must make at least one commit.
Your final submission will consist of the .Rmd
source, compiled .html
, and any other files needed for the .Rmd
to compile successfully. For example, if you are reading in the data from a .csv
file stored locally (that is, in your RStudioPro server account), commit this file, and make sure that you are using a relative path to the file when you read in the data. If you are reading the dataset directly from an R package or from a URL, this is not necessary.
Whatever state those files are in at the deadline is what I will grade.
This project is worth a total of 10% of the course grade. There is a group component and an individual component to the grade, each weighted equally (5% each).
The typical division of labor is that each group member is individually responsible for at least one graphic, along with the part of the writeup and methodology section directly pertaining to that graph, and the group as a whole works together to write and edit the general introduction and conclusion, along with any components of the Methods section that pertain to the project as a whole. Your group may choose to divide the work differently, but be sure that each person is involved in the topic selection and planning stage, the coding component, the “general audience” writing element, and the “technical writing” element.
Wrangling:
Visualization:
.Rmd
(not included from an external file)Markdown:
.Rmd
compiles successfully.html
outputVersion Control:
Writing:
Communicating Methodology:
Wrangling:
gather()
and spread()
if it would be useful to do soVisualization:
Writing:
Code Style:
.Rmd
file is well documented, and a consistent coding style is followed (e.g., line breaks and indentation are used in an intentional and purposeful way to improve code readabilty; a consistent variable naming scheme is followed)Wrangling:
Writing:
Visualization:
Good Faith:
Wrangling:
Visualization:
ggplot2
codeWriting:
Version Control:
Good Faith:
Wrangling:
Visualization:
Writing:
Wrangling:
Visualization: