Partnership Roles

  • Whichever person’s first name is first alphabetically is the designated screen-sharer for this lab (if feasible).
  • Whoever’s name is first alphabetically (or second if a group of two) is the designated reader

Attribution

Parts of this lab use source material by Andrew Bray, Mine Çetinkaya-Rundel, and the UCLA statistics department which accompanies the OpenIntro statistics textbooks. This document, as well as the source material, is covered by a CreativeCommons Attribution-ShareAlike 3.0 Unported license.

Learning Objectives

The goals of this lab are

  • to strengthen your intuitions about scatterplots, correlation, and regression models through hands-on practice
  • to learn how to create/calculate the above in R

Scatterplots and Correlation

Background and Data

The Consumer Reports 2015 New Car Buying Guide contains lots of information for a large number of new (at that time) car models. Some of the data for 110 of these cars has been extracted. This exploration will focus on the relationships among several of these variables, including:

Variable Description
Weight Weight of the car (in pounds)
CityMPG EPA’s estimated miles per gallon for city driving
FuelCap Size of the gas tank (in gallons)
QtrMile Time (in seconds) to go 1/4 mile from a standing start
Acc060 Time (in seconds) to accelerate from zero to 60 mph
PageNum Page number on which the car appears in the buying guide

Consider the relationship you would expect to see between each the following pairs of variables for the car data.

  • Weight vs. CityMPG
  • Weight vs. FuelCap
  • PageNum (ber) vs. FuelCap (acity)
  • Weight vs. QtrMile Time
  • Acc060 vs. QtrMile Time
  • CityMPG vs. QtrMile Time
  1. Place the numbers 1-6 on the chart below to indicate your guess as to the direction (negative, neutral, or positive) and strength of the association between each pair of variables in the list above. Note: If you’re looking at the .Rmd file the pairs of variables above aren’t numbered — in the Knitted .html, the first pair is numbered 1, the second is 2, etc. Note 2: There is no formal definition of what counts as weak, moderate, or strong; just try to give a rough ordering. You may have more than one letter at about the same spot.

Response to Exercise 1

(responses will vary, since this exercise was just asking you to make predictions)

Strong - Moderate - Weak - None Weak + Moderate + Strong +

The Data

The dataset is available in the Lock5Data package as Cars2015.

  1. Load the package and the data

Code for Exercise 2


Making scatterplots

The gf_point() function (provided by mosaic, which needs to be loaded) will produce a scatterplot of two variables from a dataset. In its basic usage, it follows the same syntax template as other plot functions we’ve seen; that is,

## This is not real code, just a template pattern
gf_point(ResponseVariable ~ ExplanatoryVariable, data = DataSetName)
  1. For each of the six pairs of variables you considered, create a scatterplot. Put the first variable in the table on the x-axis and the second one on the y-axis. Optional: Add xlab= and/or ylab= arguments (with values equal to axis labels in quotation marks) to modify the axis labels. I’ve done the first one for you.

Code for Exercise 3

  • Weight vs. CityMPG
  • Weight vs. FuelCap
  • PageNum (ber) vs. FuelCap (acity)
  • Weight vs. QtrMile Time
  • Acc060 vs. QtrMile Time
  • CityMPG vs. QtrMile Time
library(Lock5Data)  ## This would have been in Exercise 2
data(Cars2015)      ## This would have been in Exercise 2
gf_point(
  CityMPG ~ Weight, ## The line breaks make the code easier to read
  data = Cars2015,  ## but don't affect the results
  xlab = "Weight (lbs)",
  ylab = "City Miles per Gallon")


  1. Now that you have looked at the data, revise your initial guesses. How did you do?

Response to Exercise 4

There will again be a variety of answers here, and there are no clear delineations between what counts as “Strong”, vs “Moderate”, vs “Weak”, but you should have the pairs in a similar ordering to someone else based on the plots.

Strong - Moderate - Weak - None Weak + Moderate + Strong +

Computing correlation coefficients

The function to compute (Pearson’s) correlations is cor(). It has the same basic argument structure as gf_point():

## Again, just a template; this won't run
cor(Variable1 ~ Variable2, data = DataSetName)
  1. For each of the six pairs of variables you considered, compute Pearson’s correlation.

For Future Reference Purposes: In some datasets, Some cases may have missing values for some variables. By default if either of the two variables in question has any missing data, the correlation function will return “missing” (the special value NA, for “not available” in R’s jargon). We can ask R to instead just skip over cases that have missing values by adding the argument use = "pairwise.complete" in the cor() function. This should not be necessary here, however, since the data is complete.


Code for Exercise 5


Optional RMarkdown Aside: Displaying variables’ values inline within text

If I assign a value to a variable, such as:

thisYear <- 2021

then I can refer to that value outside a code chunk as follows (as long as there is a code chunk defining the variable which is above its first use):

“The year is 2021”

(compare the .Rmd source for the line above to what is displayed in the Knitted HTML)

The r is part of the pattern here; without this the text in the backticks would be displayed in a “code” font. With it, the expression after the r is replaced with its value.

  1. (Optional) Revise your code for Exercise 5 by assigning the values of the correlations to named objects, instead of printing out their values below the chunk. Then, use the inline reference syntax to put the values of those variables in the table below.

Solution to Exercise 6

## Replace 0 with the command that calculates a correlation and then
## write a similar assignment line for each pair of variables
variableNameHere <- 0  
Pair Correlation
Weight and CityMPG 0
Weight and QtrMile r
Weight and FuelCap r
Acc060 and QtrMile r
PageNum and FuelCap r
CityMPG and QtrMile r

Simple Linear Regression

Background

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

The goal of this exploration is to fit and examine various simple (one explanatory variable) linear regression models to predict the number of runs scored by baseball teams in a season, using a variety of common team-level measures of a team’s offensive performance.

The Data

We will use a dataset from the 2011 Major League Baseball season. In addition to runs scored (Runs), there are seven traditionally used variables in the data set: AtBats, Hits, HomeRuns, BattingAvg, Strikeouts, StolenBases, Wins.

There are also three less traditional variables: on base percentage (OBP), slugging percentage (SLG), and on-base plus slugging (OPS).

The data is located at http://colindawson.net/data/mlb11.csv.

First we need to read the data in with read.file(). I’ll assign it to an object called MLB11:

MLB11 <- read.file("http://colindawson.net/data/mlb11.csv")

A Description of the Variables in the Dataset

Variable Description
Team Team name
Runs Number of runs
AtBats Number of at bats
Hits Number of hits
HomeRuns Number of home runs
BattingAvg Batting average
Strikeouts Number of strikeouts
StolenBases Number of stolen bases
Wins Number of wins
OBP On-base percentage
SLG Slugging percentage
OPS On base plus slugging

On-base percentage (OBP) is a measure of what proportion of the time a batter reaches base safely. Most often this is the result of a hit or a walk, but being hit by a pitch (which grants the hitter first base) also counts. Fielding errors, fielder’s choice, dropped/uncaught third strikes, fielder’s obstruction, or catcher’s interference do not count.

Slugging average (SLG) is a popular measure of the power of a hitter, calculated as the total bases obtained on a hit (one for a single, two for a double, three for a triple, and four for a homerun) divided by the number of at bats

On-base plus slugging (OPS) is just the sum of these two variables

Predicting Runs with AtBats

In baseball, teams score by achieving Runs, so this is an “intrinsically valuable” outcome that our regression models will try to predict. Let’s produce a scatterplot to examine the relationship between Runs and AtBats in the MLB11 dataset, treating Runs as the response variable.

gf_point(Runs ~ AtBats, data = MLB11)

  1. Estimate Pearson’s r for these two variables by eyeballing the plot. Then, compute the actual correlation.

Response and Solution to Exercise 7


Finding the line of best fit

The “line of best fit” is defined as the line that makes the sum of squared residuals as small as possible, where, recall, the residual for each point is the gap between its actual “y” value and the “y” value the model predicts it would have based on its “x” value.

The best-fitting line according to this criterion can be obtained analytically with calculus, but we can just use the lm() function in R to give us the intercept and slope of the best-fitting line.

We will want to use the resulting linear model later, so we’ll assign the resulting fitted regression model to a named R object.

AtBatsModel <- lm(Runs ~ AtBats, data = MLB11)

You can display the model coefficients (that is, intercept and slope) by calling the coef() function on the model object. The intercept will be labeled as (Intercept); the slope will be under the name of the “x” variable.

coef(AtBatsModel)
## (Intercept)      AtBats 
## -2789.24289     0.63055
  1. Write down the prediction equation for the best fit line (in the form \(\hat{y} = \hat{a} + \hat{b} x\)), filling in the values of \(\hat{a}\) and \(\hat{b}\) found by lm(), and replacing \(y\) and \(x\) with the names of the actual variables in the data.

Response to Exercise 8


Plotting data with the best fit line

We can add the line of best fit to our scatterplots by “chaining together” two plot commands: gf_point() for the scatterplot, and gf_lm() for the line. The syntax is as follows:

gf_point(Runs ~ AtBats, data = MLB11) %>% 
  gf_lm()

R Aside: pipe syntax

The %>% symbol (written with a percent sign, right angle bracket, percent sign) is called a pipe: it sends the output of one command (in this case, a plot) to another command as an argument (most of the gf_() plot functions can accept an optional first argument in the form of a plot produced by another gf_() function. If one is specified, then the existing plot is added to, instead of creating a new one).

Note that gf_lm() has no arguments specified; it inherits these from the plot produced by gf_point(), and doesn’t need any new information.

Finding predicted values

  1. If a team manager saw the least squares regression line and not the actual data, how many runs would they predict for a team with 5,578 at-bats? Estimate this visually first, then find the precise value using the equation for the line.

Response and Solution to Exercise 9


  1. Fit a new model that uses HomeRuns to predict Runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

Code and Response to Exercise 10


Assessing Model Quality

Not every linear model is appropriate, even if the residuals are small. We should check at least two things:

  • Linearity: Is there a “leftover” pattern in the residuals which is associated with the explanatory variable or with the predicted values? If so, the relationship is likely not linear.
  • Approximate “Normality”: Is a histogram of just the residuals approximately bell-shaped (“Normally distributed”)? If not, the best fit line may not be reliable, due to skew or outliers.

To check for linearity, we should plot the residuals against either the explanatory variable or the fitted “y” values (the values right on the line):

## Plotting residuals against explanatory variable
gf_point(residuals(AtBatsModel) ~ AtBats, data = MLB11) %>%
  gf_lm()
## Plotting residuals against fitted values
gf_point(residuals(AtBatsModel) ~ fitted.values(AtBatsModel), data = MLB11) %>%
  gf_lm()
  1. Do you see any patterns here that might raise concerns?

Response to Exercise 11


To check for Normality (bell-shapedness) we can create a density plot of the residuals, which we can compare to a bell curve.

## By using dhistogram instead of histogram, we get density on the y-axis. This is 
## necessary if we want to overlay a density curve as we have here
gf_dhistogram(~residuals(AtBatsModel), data = MLB11, binwidth = 25) %>% 
  gf_density()

  1. Using the HomeRuns model you created above, produce the following plots: (1) A scatterplot with the residuals on the y axis and the explanatory variable on the x axis; (2) A scatterplot with the residuals on the y axis and the fitted values on the x axis; (3) a histogram of the residuals. Does the fit look roughly linear? Are the residuals roughly Normal?

Code and Response to Exercise 12


HOMEWORK: Finding the Best Single Predictor of Runs

  1. Explore the other variables in the MLB11 data, and find the one that does the best job of predicting Runs, as measured by the absolute value of the correlation. Fit a regression model using this variable as the explanatory variable, and produce and interpret the standard set of diagnostic plots.

Code and Response to Exercise 13