STAT 209: Final Project Description

Deadline

The official deadline for the final writeup is Monday 8/30 at 4pm, which is the end of the designated “final exam” time for this class.

Goals

This project builds on the previous one in scope. Unlike the first two projects, this one is meant to be done individually. Exceptions can be made for ambitious proposals; talk to me if you have an idea that you think might lend itself better to working with a partner.

There are three basic options:

Option 1: Do a new analysis using a dataset that interests you, along the same lines as project 2, but using a large dataset, accessed from a database via SQL.
Option 2: Do a new analysis involving one or more of the advanced or specialized techniques that haven’t been part of a project yet. These include: clustering, dimensionality reduction, geographic data, or text data.
Option 3: Some mix of 1 and 2

Note: You can pick up from where Project 1 or 2 left off if you want to pursue a similar set of topics, or you can pick a different topic entirely. However, if you do continue with one of the topics you pursued before, make sure you coordinate with the other people who worked on those projects with you, in case they also wish to do that, to make sure you are taking things in sufficiently distinct directions.

The final writeup

The writeup should be structured similarly to previous projects (see the project 2 description for guidelines)

Data Visualization Requirements

As always, prioritize quality over quantity. A few really well thought out graphs that take a fair amount of work is much better than several hasty ones.

As always, the graphs must be created within your .Rmd using ggplot2, with preliminary wrangling performed using a combination of dplyr and other tools we have used in this class (you are welcome to make use of other tools if there are things you aren’t able to tackle with content we’ve covered, but try to find a tidyverse solution if possible).

Wrangling Requirements

As with Project 2, you should employ a number of the five basic verbs (or their SQL equivalents), along with joins and/or pivots, and possibly custom functions/iteration as needed to get your data into a form conducive to the visualizations you want to construct using code which is as concise and readable as possible.

You will likely not use every one of these elements, but your wrangling should involve non-trivial manipulation of datasets.

Whether you doing your wrangling entirely within R or through a mix of R and SQL, you should put your code in RMarkdown code chunks. In general try to keep your chunks short. Each chunk should generally do just one thing.

Extra Elements

You should employ one or more of the following elements in your project:

SQL
Manipulation of text using regular expressions
Clustering
Dimensionality Reduction
Geographic data and spatial layering/projection

The expected depth of mastery of the element(s) used is inversely proportional to the number of different techniques that are required to do what you want to do. Don’t combine techniques that don’t make sense together, but some creative combination of techniques can substitute for technical complexity within a method.

Data

Any data source that was fair game for Project 2 is fair game here: you can’t use a dataset we’ve used in class or a lab, nor data built into an R package (at least not without running your idea by me).

The GitHub workflow

See the Project 2 description for an outline of the recommended GitHub workflow.
The GitHub clasroom link to create your project repo is here

Turning in your project

Even though you are working by yourself, you should still record the history of edits to your project via GitHub commits.

As always your final submission will consist of the .Rmd source, compiled .html, and any other files needed for the .Rmd to compile successfully. Whatever state those files are in at the deadline is what I will grade.

Grading

The final grade for each SLO in the course consists of the simple average of (1) the highest lab or quiz grade for that SLO, (2) the higher of the two group project grades (where applicable), and (3) the grade on this project.

Rubric

Data Science Workflow

A1: Demonstrate basic fluency with programming fundamentals
A2: Create clean, reproducible reports
- The graphics are generated by the code embedded in the .Rmd (not included from an external file)
- The .Rmd compiles successfully
- Code, unnecessary messages, and raw R output (other than the plots) are suppressed from the .html output
A3: Use a version control system for collaboration and documentation
- There is a GitHub record of commits
- The commit messages are concise and informative
A4: Produce clean, readable code
- Variable names are descriptive
- Line breaks and indentation are used to highlight the structure of the code

Understanding Visualization

B1: Identify variables, visual cues, and mappings between them
- The choices of aesthetic mappings and visual elements is motivated well in the Methodology section
B2: Identify key patterns revealed by a visualization
- Concise summaries of each individual visualization are included
B3: Identify strengths and weaknesses of particular visualizations
- The summaries highlight what the visualization shows clearly, what it doesn’t, some improvements that could be made with additional data or technical skills

Creating Visualizations

C1: Choose appropriate and effective graphical representations
- The visualizations chosen fit together to illustrate interesting features about the data
- The choices made for your visualizations are effective and allow information to be conveyed clearly and efficiently
C2: Employ informative annotation and visual cues to guide the reader
C3: Write clean, readable visualization code
- Pipe syntax is used to promote readability
- Line breaks and indentation are used to highlight the structure of the visualization code

Translating between the qualitative and the quantitative

D1: Choose suitable datasets to address questions of interest
D2: Describe what data suggests in language suitable for a general audience
D3: Extract “takehome messages” across multiple visualizations
- There is a description of “big picture” insights gained from considering the visualizations as a set
- The graphics used collectively convey aspects of the data that would have been difficult to notice with any single view

Data Wrangling (You do not have to use all of these)

E1: Master “slicing and dicing” data to access needed elements (e.g., with filter(), select(), slice_max(), etc)
E2: Create new variables from old to serve a purpose (e.g., with mutate(), pivot_longer() or pivot_wider(), or similar operations that transform the representation of the data)
E3: Aggregate and summarize data within subgroups (e.g., with group_by() and summarize())
E4: Join data from multiple sources to examine relationships (with join() operations)

Intermediate Data Science Tools

F1: Modularize repetitive tasks (e.g., by writing your own functions and/or using iteration constructs like lapply() or do())
F2/F3: Perform basic interactions with a database OR employ at least one specialized statistical or visualization technique