Parts of this lab use source material by Andrew Bray, Mine Çetinkaya-Rundel, and the UCLA statistics department which accompanies the OpenIntro statistics textbooks. This document, as well as the source material, is covered by a CreativeCommons Attribution-ShareAlike 3.0 Unported license.
The goals of this lab are
The Consumer Reports 2015 New Car Buying Guide contains lots of information for a large number of new (at that time) car models. Some of the data for 110 of these cars has been extracted. This exploration will focus on the relationships among several of these variables, including:
Variable | Description |
---|---|
Weight |
Weight of the car (in pounds) |
CityMPG |
EPA’s estimated miles per gallon for city driving |
FuelCap |
Size of the gas tank (in gallons) |
QtrMile |
Time (in seconds) to go 1/4 mile from a standing start |
Acc060 |
Time (in seconds) to accelerate from zero to 60 mph |
PageNum |
Page number on which the car appears in the buying guide |
Consider the relationship you would expect to see between each the following pairs of variables for the car data.
Weight
vs. CityMPG
Weight
vs. FuelCap
PageNum
(ber) vs. FuelCap
(acity)Weight
vs. QtrMile
TimeAcc060
vs. QtrMile
TimeCityMPG
vs. QtrMile
Time.Rmd
file the pairs of variables above aren’t numbered — in the Knitted .html
, the first pair is numbered 1, the second is 2, etc. Note 2: There is no formal definition of what counts as weak, moderate, or strong; just try to give a rough ordering. You may have more than one letter at about the same spot.(responses will vary, since this exercise was just asking you to make predictions)
Strong - | Moderate - | Weak - | None | Weak + | Moderate + | Strong + |
---|---|---|---|---|---|---|
The dataset is available in the Lock5Data
package as Cars2015
.
The gf_point()
function (provided by mosaic
, which needs to be loaded) will produce a scatterplot of two variables from a dataset. In its basic usage, it follows the same syntax template as other plot functions we’ve seen; that is,
## This is not real code, just a template pattern
gf_point(ResponseVariable ~ ExplanatoryVariable, data = DataSetName)
xlab=
and/or ylab=
arguments (with values equal to axis labels in quotation marks) to modify the axis labels. I’ve done the first one for you.Weight
vs. CityMPG
Weight
vs. FuelCap
PageNum
(ber) vs. FuelCap
(acity)Weight
vs. QtrMile
TimeAcc060
vs. QtrMile
TimeCityMPG
vs. QtrMile
Timelibrary(Lock5Data) ## This would have been in Exercise 2
data(Cars2015) ## This would have been in Exercise 2
gf_point(
~ Weight, ## The line breaks make the code easier to read
CityMPG data = Cars2015, ## but don't affect the results
xlab = "Weight (lbs)",
ylab = "City Miles per Gallon")
There will again be a variety of answers here, and there are no clear delineations between what counts as “Strong”, vs “Moderate”, vs “Weak”, but you should have the pairs in a similar ordering to someone else based on the plots.
Strong - | Moderate - | Weak - | None | Weak + | Moderate + | Strong + |
---|---|---|---|---|---|---|
The function to compute (Pearson’s) correlations is cor()
. It has the same basic argument structure as gf_point()
:
## Again, just a template; this won't run
cor(Variable1 ~ Variable2, data = DataSetName)
For Future Reference Purposes: In some datasets, Some cases may have missing values for some variables. By default if either of the two variables in question has any missing data, the correlation function will return “missing” (the special value NA
, for “not available” in R’s jargon). We can ask R to instead just skip over cases that have missing values by adding the argument use = "pairwise.complete"
in the cor()
function. This should not be necessary here, however, since the data is complete.
If I assign a value to a variable, such as:
<- 2021 thisYear
then I can refer to that value outside a code chunk as follows (as long as there is a code chunk defining the variable which is above its first use):
“The year is 2021”
(compare the .Rmd
source for the line above to what is displayed in the Knitted HTML)
The r
is part of the pattern here; without this the text in the backticks would be displayed in a “code” font. With it, the expression after the r
is replaced with its value.
## Replace 0 with the command that calculates a correlation and then
## write a similar assignment line for each pair of variables
<- 0 variableNameHere
Pair | Correlation |
---|---|
Weight and CityMPG |
0 |
Weight and QtrMile |
r |
Weight and FuelCap |
r |
Acc060 and QtrMile |
r |
PageNum and FuelCap |
r |
CityMPG and QtrMile |
r |
The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
The goal of this exploration is to fit and examine various simple (one explanatory variable) linear regression models to predict the number of runs scored by baseball teams in a season, using a variety of common team-level measures of a team’s offensive performance.
We will use a dataset from the 2011 Major League Baseball season. In addition to runs scored (Runs
), there are seven traditionally used variables in the data set: AtBats
, Hits
, HomeRuns
, BattingAvg
, Strikeouts
, StolenBases
, Wins
.
There are also three less traditional variables: on base percentage (OBP
), slugging percentage (SLG
), and on-base plus slugging (OPS
).
The data is located at http://colindawson.net/data/mlb11.csv
.
First we need to read the data in with read.file()
. I’ll assign it to an object called MLB11
:
<- read.file("http://colindawson.net/data/mlb11.csv") MLB11
Variable | Description |
---|---|
Team | Team name |
Runs | Number of runs |
AtBats | Number of at bats |
Hits | Number of hits |
HomeRuns | Number of home runs |
BattingAvg | Batting average |
Strikeouts | Number of strikeouts |
StolenBases | Number of stolen bases |
Wins | Number of wins |
OBP | On-base percentage |
SLG | Slugging percentage |
OPS | On base plus slugging |
On-base percentage (OBP) is a measure of what proportion of the time a batter reaches base safely. Most often this is the result of a hit or a walk, but being hit by a pitch (which grants the hitter first base) also counts. Fielding errors, fielder’s choice, dropped/uncaught third strikes, fielder’s obstruction, or catcher’s interference do not count.
Slugging average (SLG) is a popular measure of the power of a hitter, calculated as the total bases obtained on a hit (one for a single, two for a double, three for a triple, and four for a homerun) divided by the number of at bats
On-base plus slugging (OPS) is just the sum of these two variables
Runs
with AtBats
In baseball, teams score by achieving Runs
, so this is an “intrinsically valuable” outcome that our regression models will try to predict. Let’s produce a scatterplot to examine the relationship between Runs
and AtBats
in the MLB11
dataset, treating Runs
as the response variable.
gf_point(Runs ~ AtBats, data = MLB11)
The “line of best fit” is defined as the line that makes the sum of squared residuals as small as possible, where, recall, the residual for each point is the gap between its actual “y” value and the “y” value the model predicts it would have based on its “x” value.
The best-fitting line according to this criterion can be obtained analytically with calculus, but we can just use the lm()
function in R to give us the intercept and slope of the best-fitting line.
We will want to use the resulting linear model later, so we’ll assign the resulting fitted regression model to a named R object.
<- lm(Runs ~ AtBats, data = MLB11) AtBatsModel
You can display the model coefficients (that is, intercept and slope) by calling the coef()
function on the model object. The intercept will be labeled as (Intercept)
; the slope will be under the name of the “x” variable.
coef(AtBatsModel)
## (Intercept) AtBats
## -2789.24289 0.63055
lm()
, and replacing \(y\) and \(x\) with the names of the actual variables in the data.We can add the line of best fit to our scatterplots by “chaining together” two plot commands: gf_point()
for the scatterplot, and gf_lm()
for the line. The syntax is as follows:
gf_point(Runs ~ AtBats, data = MLB11) %>%
gf_lm()
The %>%
symbol (written with a percent sign, right angle bracket, percent sign) is called a pipe: it sends the output of one command (in this case, a plot) to another command as an argument (most of the gf_()
plot functions can accept an optional first argument in the form of a plot produced by another gf_()
function. If one is specified, then the existing plot is added to, instead of creating a new one).
Note that gf_lm()
has no arguments specified; it inherits these from the plot produced by gf_point()
, and doesn’t need any new information.
HomeRuns
to predict Runs
. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?Not every linear model is appropriate, even if the residuals are small. We should check at least two things:
To check for linearity, we should plot the residuals against either the explanatory variable or the fitted “y” values (the values right on the line):
## Plotting residuals against explanatory variable
gf_point(residuals(AtBatsModel) ~ AtBats, data = MLB11) %>%
gf_lm()
## Plotting residuals against fitted values
gf_point(residuals(AtBatsModel) ~ fitted.values(AtBatsModel), data = MLB11) %>%
gf_lm()
To check for Normality (bell-shapedness) we can create a density plot of the residuals, which we can compare to a bell curve.
## By using dhistogram instead of histogram, we get density on the y-axis. This is
## necessary if we want to overlay a density curve as we have here
gf_dhistogram(~residuals(AtBatsModel), data = MLB11, binwidth = 25) %>%
gf_density()
HomeRuns
model you created above, produce the following plots: (1) A scatterplot with the residuals on the y axis and the explanatory variable on the x axis; (2) A scatterplot with the residuals on the y axis and the fitted values on the x axis; (3) a histogram of the residuals. Does the fit look roughly linear? Are the residuals roughly Normal?MLB11
data, and find the one that does the best job of predicting Runs
, as measured by the absolute value of the correlation. Fit a regression model using this variable as the explanatory variable, and produce and interpret the standard set of diagnostic plots.