This project builds on the previous one in scope. Unlike the first two projects, this one is meant to be done individually. I may make exceptions for unusually ambitious proposals; talk to me if you have an idea that you think might lend itself better to working with a partner.
There are four basic options:
Option 1: Do a new analysis using a dataset that interests you, along the same lines as project 2, but using a large dataset, accessed from a database via SQL.
Option 2: Do a new analysis involving one or more of the advanced or specialized techniques that haven’t been part of a project yet. These include: interactive graphics, clustering, dimensionality reduction, geographic data, text data.
Option 3: Some mix of 1 and 2
Option 4: “Improve and extend” one of your first two projects. The “improve” part means revising elements of the original project that you weren’t entirely satisfied with the first time. The “extend” part means taking advantage of tools that you learned after you did the original project to look at the data from a different perspective, complement the analysis with additional data, apply some of the machine learning algorithms you will have learned by that point, etc. The degree of extension expected is inversely related to the scope of the original project: a revision of project 1 should be more different from the original than a revision of project 2. If more than one person from the same group decides to extend a project that they worked on together, they should take it in distinct directions.
Note: since revising an existing project means starting with a base of work that was already evaluated (and which was not entirely your own), the expectations for quality are higher in this case than for a new project starting from scratch.
The writeup should be structured similarly to previous projects (see the project 2 description for guidelines)
As always, prioritize quality over quantity. A few really well thought out graphs that take a fair amount of work is much better than several hasty ones.
As always, the graphs must be created within your .Rmd
using ggplot2
, and you should endeavor to do whatever wrangling is necessary in a separate pipeline so that the visualization code is as well structured and concise as possible.
As with Project 2, you should employ a number of the five basic verbs (or their SQL equivalents), along with joins, gather, spread, and custom functions/iteration as needed to get your data into a form conducive to the visualizations you want to construct.
You will likely not use every one of these elements, but your wrangling should involve non-trivial manipulation of datasets.
Whether you doing your wrangling entirely within R
or through a mix of R
and SQL
, you should put your code in RMarkdown code chunks, and keep your wrangling pipeline(s) separate from your visualization pipeline(s), for readability’s sake.
You should employ one or more of the following elements in your project:
Any data source that was fair game for Project 2 is fair game here: you can’t use a dataset we’ve used in class or a lab, nor data built into an R package (at least not without running your idea by me).
Even though you are working by yourself, you should still record the history of edits to your project via GitHub commits.
As always your final submission will consist of the .Rmd
source, compiled .html
(or .pdf
if you prefer), and any other files needed for the .Rmd
to compile successfully. Whatever state those files are in at the deadline is what I will grade.
This project makes up 25% of the final grade for the course. In addition, your overall project grade (for projects 1-3, totalling 45% of the course grade) cannot be below your grade for this project.
Since the specs for this project are more open, the grading rubric is not as concrete this time, but here is a rough rubric:
Markdown and Version Control (10%):
.Rmd
compiles successfully with no error messages.html
outputWrangling (20%):
Visualization (20%):
Coding Style (10%):
Basic Structure (15%):
Polish (10%):
Because there are many different options for the “extra element(s)” in the project, with varying degrees of technical difficulty associated with them, I cannot list detailed criteria for each possibility.
Therefore, I will rate the ambitiousness (A) of the project on a 0-4 scale, and the overall technical accuracy of the project (T), on a 0-6 scale.
The score for this component will be calculated as min(15, A*T)
.
In other words, to get full credit, the product of your ambitiousness and technical accuracy should be at least 15 (for example, your project could be 3/4 on the ambitiousness scale and 5/6 on the correctness scale, or 4/4 on the ambitiousness scale and 4/6 on the correctness scale).
I will be impressed if your project is both extremely ambitious and flawless, but your grade won’t exceed 15; that is, technical flourish isn’t a substitute for a good writeup, good coding practice, etc.