STAT 113 Lab 3: Correlation and Regression

Partnership Roles

One of you (the person in column “B” in the table) will be the “reader”. Your job is to have this document open, read it aloud.
The other (in column “A”) is the “scribe”. You are the one who, for this assignment, does the typing. The scribe can ask the reader to stop, and can ask clarification questions, but should type only what the reader tells them to type.
In places where you need to write your own code, discuss it together, but then the reader should make the final call as to what gets typed.

This should help you slow down, read carefully, pay attention to detail, and make sure you understand what you’re reading and writing.

Scribe Only: Log in to the Oberlin RStudioPro Server and Create a Project and Markdown File

Log in at rstudiopro.oberlin.edu.
Click the Project dropdown menu in the top right of the RStudio window and select New Project. Call the project lab3 and make a directory for it inside your stat113 folder.
Open the file lab3-template.Rmd file (back in the stat113 folder) and immediately save it to stat113/lab3 as lab3.Rmd. Read the Markdown instructions in the file before moving on.

Attribution

Parts of this lab are modified by Colin Dawson from source material by Andrew Bray, Mine Çetinkaya-Rundel, and the UCLA statistics department which accompanies the OpenIntro statistics textbooks. This handout as well as the source material is covered by a CreativeCommons Attribution-ShareAlike 3.0 Unported license.

Summary of this Lab

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

The goal of this lab is to explore various simple linear regression models to predict the number of runs scored by baseball teams in a season, using a variety of common team level measures of a team’s offensive performance.

What to Turn In

Turn in two files: your lab3.Rmd RMarkdown source file and the Knitted lab3.html file. Both of these should go in the ~/stat113/turnin/lab3 folder in the scribe’s account (if you finish separately, turn in your own copies and clearly indicate where you started working separately).

Using Hitting Stats to Predict MLB Team Performance

The Data

We will use a dataset from the 2011 Major League Baseball season. In addition to runs scored (Runs), there are seven traditionally used variables in the data set: AtBats, Hits, HomeRuns, BattingAvg, Strikeouts, StolenBases, Wins.

There are also three newer variables: on base percentage (OBP), slugging percentage (SLG), and on-base plus slugging (OPS). For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the newer variables on your own.

The data is located at http://colindawson.net/data/mlb11.csv.

First we need to read the data in with read.file() as MLB11:

## Click the "run previous chunks" button first, or `read.file()` won't be found
MLB11 <- read.file("http://colindawson.net/data/mlb11.csv")

What type of plot would you use to display the relationship between Runs and one of the other quantitative variables? Plot this relationship using the variable AtBats on the x-axis (you might need to consult previous labs/handouts for a reminder of the function name and syntax). Looking at your plot, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations. Does the relationship look linear? If so, quantify the strength of the relationship by computing the correlation coefficient.

Finding the Best-Fit Line

We have said that the magnitude of the correlation coefficient tells us how tightly the data fits a straight line, and the sign tells us whether the slope of that line is positive or negative. But how do we know what the “best” line is?

Definition: Residuals

For any line we might draw through the data, we can ask for each data point (a) whether it is above or below the line, and (b) how close it is to the line in the vertical direction. If we measure this vertical distance and assign a positive value if the case is above the line and a negative value if the case is below the line, we have what is known as the residual for that case.

The “line of best fit” is defined as the line that makes these residuals collectively as small in magnitude as possible (technically, it minimizes the sum of squared residuals).

Visualizing the residuals

Type the following at the console (not in your Markdown document) to produce an interactive plot that will let you draw your own regression line and then will show you the residuals associated with it.

Make sure you have run the chunk above that reads in the data first.

## This is not the way we'll usually plot lines of best fit. This is just for this
## example to see how different lines give us different residuals.
library(oilabs)
## Type this line directly in the console; not in a Markdown chunk
plot_ss(y = Runs, x = AtBats, data = MLB11)

If you ran the above in the console, you should get a plot in the plot tab in the lower right. You may need to expand or “pop out” (Zoom) the plot to see the whole graph.

After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the (signed) vertical difference between the observed values of the “y” variable and the “y” values predicted by the line.

The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE to the plot_ss() function.

## Type this line in the console, not in a Markdown chunk
plot_ss(y = Runs, x = AtBats, data = MLB11, showSquares = TRUE)

Note that the sum of squared residuals is displayed in the console when you choose your line.

Re-run the line above a few times (at the console), selecting different lines, and see how small you can get the sum of squared residuals (SSR) to be.
Write down the equation for your best line (in the form \(\hat{y} = \hat{a} + \hat{b} x\)). (The “hats” signify values estimated from data. Don’t worry about getting them to show up in Markdown; just fill in variable names for \(\hat{y}\) and \(x\), and fill in numbers for \(\hat{a}\) and \(\hat{b}\))

It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. The best line can be obtained analytically with calculus, but we can just use the lm() function in R to give us the intercept and slope of the best-fitting line.

We will want to use the resulting linear model later, so we’ll save the result to a named R object.

AtBatsModel <- lm(Runs ~ AtBats, data = MLB11)

You can display the model coefficients (that is, intercept and slope) by calling the coef() function on the model object. The intercept will be labeled as (Intercept); the slope will be under the name of the explanatory variable (in this case, AtBats).

coef(AtBatsModel)

Write down the prediction equation for the best fit line (in the form \(\hat{y} = \hat{a} + \hat{b} x\)) found by lm(). Did you get close with your trial-and-error approach?

Plotting data with the best fit line

We can add the line of best fit to our scatterplots by passing the scatterplot on to the gf_lm() function as follows:

gf_point(Runs ~ AtBats, data = MLB11) %>% gf_lm()

If a team manager saw the least squares regression line and not the actual data, how many runs would they predict for a team with 5,578 at-bats? Estimate this visually first, then find the precise value using the equation for the line.

From now on we will use lm() instead of plot_ss() to find the best line directly. (The plot_ss() was just to illustrate how different lines give different sums of squared errors.)

Fit a new model that uses HomeRuns to predict Runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

Measuring fit with \(R^2\)

Definition: \(R^2\)

One measure of how poorly the model fits the data is the sum of the squared residuals. A related measure that tells us how well the model fits is called the coefficient of determination, denoted by \(R^2\).

This is a number that ranges from 0 to 1 which indicates what proportion of the total variability in the response variable is “predictable” using a linear model if we know the value of the explanatory variable.

One way to think about this is to imagine we didn’t use the explanatory variable, and instead just drew a horizontal line through the data at the mean of the response variable. If we did this, the residuals will tend to be larger than if we used the best fitting line.

If we divide the sum of squared residuals using the best fit line (\(SSR_{model}\)) by the sum of squared residuals using a baseline horizontal line at the “y” mean (\(SSR_{baseline}\)), we’ll get a number between 0 and 1. This number represents the “unpredictable” variability in the response variable.

The rest of the variability (that is, one minus this number) is predictable. We call this quantity \(R^2\).

\[ R^2 = 1 - SSR_{model} / SSR_{baseline} \]

It turns out that this is also equal to the square of the correlation between the variables; hence the notation.

The \(R^2\) value of a model can be found in R as follows:

rsquared(AtBatsModel)

Compare the \(R^2\) value for the AtBats model to the \(R^2\) value for the HomeRuns model. Which explanatory variable does a better job of predicting the number of runs scored?

Assessing Model Quality

Not every linear model is appropriate, even if the residuals are small. We should check at least two things:

Linearity: Is there a “leftover” pattern in the residuals which is associated with the explanatory variable or with the predicted values? If so, the relationship is likely not linear.
Approximate “Normality”: Is a histogram of just the residuals approximately bell-shaped (“Normally distributed”)? If not, the best fit line may not be reliable, due to skew or outliers.

To check for linearity, we should plot the residuals against either the explanatory variable or the fitted “y” values (the values right on the line):

## Plotting residuals against explanatory variable
gf_point(residuals(AtBatsModel) ~ AtBats, data = MLB11) %>%
  gf_lm()

## Plotting residuals against fitted values
gf_point(residuals(AtBatsModel) ~ fitted.values(AtBatsModel)) %>%
  gf_lm()
## Note that we can omit the data= argument in the second case, since
## the model object stores the residuals and fitted values with it

Do you see any pattern?

To check for Normality (bell-shapedness) we can create a density histogram (with smooth curve overlaid), which we can compare to a bell curve.

## We need the d in dhistogram to make the y axis show density
gf_dhistogram(~residuals(AtBatsModel), data = MLB11, fill = 2) %>% 
  gf_density(fill = 3)

There seems to be a bit of right skew to the distribution of the residuals.

Quantile-Quantile (QQ) Plots

An alternative plot we can use to assess Normality is called a Quantile-Quantile plot, or a QQ Plot.

It plots the quantiles (percentiles but on a proportion (0 to 1) scale) of a theoretical bell curve distribution (remember the 68%/95% rule) against the actual quantiles of the residuals.

If the fit is normal, the residuals should fall on a straight line, since the quantiles would match up perfectly. If the values are very curved or form an \(S\)-shape, that is a sign that the residual distribution is skewed, or has values that are more extreme than expected.

gf_qq(~residuals(AtBatsModel)) %>% gf_qqline()

Using the HomeRuns model you created above, produce the following plots: (1) A scatterplot with the residuals on the y axis and the explanatory variable on the x axis; (2) A scatterplot with the residuals on the y axis and the fitted values on the x axis; (3) a histogram of the residuals; and (4) a QQ plot of the residuals against a theoretical bell curve. Does the fit look roughly linear? Are the residuals roughly Normal?
Choose another “traditional” variable from MLB11 that you think might be a good predictor of Runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
How does this relationship compare to the relationship between Runs and AtBats? Use the \(R^2\) values from the two model summaries to compare. Does your variable seem to predict Runs better than AtBats? How can you tell?
Now that you can summarize the linear relationship between two variables, investigate the relationships between Runs and each of the other five “traditional” variables. Which variable best predicts Runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).
Now examine the three “newer” variables: on-base percentage (OBP), slugging percentage (SLG) and on-base-plus-slugging (OPS). These are the statistics used by the author of Moneyball to predict a team’s success. In general, are they more or less effective at predicting runs than the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables you’ve analyzed, which seems to be the best predictor of runs? Does the model using that variable satisfy the conditions of linearity and near-normality?

Examining the effect of outliers

The dataset depression.csv (available at http://colindawson.net/data/depression.csv) contains measurements of three quantitative variables for each of 15 patients hospitalized for depression. The variable Depression is a standard measure of depression, adjusted to control for some factors known to be correlated with depression. The variable Simplicity is a measure of patients’ need to see the world in absolutes, or “black and white”.

Read in the data as an object called Depression using read.file() and produce a scatterplot (using gf_point()) relating these two variables, showing the best fit regression line using gf_lm(). Describe the association you see, and visually estimate the correlation from the scatterplot.
Visually estimate the intercept and slope of the regression line from the plot and write the prediction equation in the form \(\hat{y} = \hat{a} + \hat{b} x\).
Use the prediction equation to predict the Depression score for someone with a Simplicity score of 2.0. If the prediction given by your equation does not match what you see on the plot, revise your equation.
Now calculate the actual correlation and regression coefficients for the depression data. Did you get close?
On the scatterplot, do any points seem unusual?

We can exclude the individual with a simplicity score over 2.0 (i.e., keeping only those with a simplicity score at or below 2.0), by using the filter() function as follows (assumes you have done Exercise 12 already to read in the data).

## The second argument is an expression defining a condition that 
## must be met for a case to be *included*
DepressionFiltered <- filter(Depression, Simplicity <= 2.0)

Recompute the correlation and regression coefficients on the filtered data. How do the results compare to those obtained on the full data?
In light of your answer to the last exercise, comment on the properties of Pearson’s correlation and “least squares” regression with respect to the concept of robustness (aka resistance).