The official deadline for the final writeup is Monday 8/30 at 4pm, which is the end of the designated “final exam” time for this class.
This project builds on the previous one in scope. Unlike the first two projects, this one is meant to be done individually. Exceptions can be made for ambitious proposals; talk to me if you have an idea that you think might lend itself better to working with a partner.
There are three basic options:
Option 1: Do a new analysis using a dataset that interests you, along the same lines as project 2, but using a large dataset, accessed from a database via SQL.
Option 2: Do a new analysis involving one or more of the advanced or specialized techniques that haven’t been part of a project yet. These include: clustering, dimensionality reduction, geographic data, or text data.
Option 3: Some mix of 1 and 2
Note: You can pick up from where Project 1 or 2 left off if you want to pursue a similar set of topics, or you can pick a different topic entirely. However, if you do continue with one of the topics you pursued before, make sure you coordinate with the other people who worked on those projects with you, in case they also wish to do that, to make sure you are taking things in sufficiently distinct directions.
The writeup should be structured similarly to previous projects (see the project 2 description for guidelines)
As always, prioritize quality over quantity. A few really well thought out graphs that take a fair amount of work is much better than several hasty ones.
As always, the graphs must be created within your .Rmd
using ggplot2
, with preliminary wrangling performed using a combination of dplyr
and other tools we have used in this class (you are welcome to make use of other tools if there are things you aren’t able to tackle with content we’ve covered, but try to find a tidyverse
solution if possible).
As with Project 2, you should employ a number of the five basic verbs (or their SQL equivalents), along with joins and/or pivots, and possibly custom functions/iteration as needed to get your data into a form conducive to the visualizations you want to construct using code which is as concise and readable as possible.
You will likely not use every one of these elements, but your wrangling should involve non-trivial manipulation of datasets.
Whether you doing your wrangling entirely within R
or through a mix of R
and SQL
, you should put your code in RMarkdown code chunks. In general try to keep your chunks short. Each chunk should generally do just one thing.
You should employ one or more of the following elements in your project:
The expected depth of mastery of the element(s) used is inversely proportional to the number of different techniques that are required to do what you want to do. Don’t combine techniques that don’t make sense together, but some creative combination of techniques can substitute for technical complexity within a method.
Any data source that was fair game for Project 2 is fair game here: you can’t use a dataset we’ve used in class or a lab, nor data built into an R package (at least not without running your idea by me).
Even though you are working by yourself, you should still record the history of edits to your project via GitHub commits.
As always your final submission will consist of the .Rmd
source, compiled .html
, and any other files needed for the .Rmd
to compile successfully. Whatever state those files are in at the deadline is what I will grade.
The final grade for each SLO in the course consists of the simple average of (1) the highest lab or quiz grade for that SLO, (2) the higher of the two group project grades (where applicable), and (3) the grade on this project.
.Rmd
(not included from an external file).Rmd
compiles successfully.html
outputfilter()
, select()
, slice_max()
, etc)mutate()
, pivot_longer()
or pivot_wider()
, or similar operations that transform the representation of the data)group_by()
and summarize()
)join()
operations)lapply()
or do()
)