The learning goals of this lab are
We will be working mostly with the Pulse
data that I used for some examples in class. The data was collected from 232 statistics students, who provided their height, weight, sex, exercise and smoking habits via a survey, and measured their pulse rate before and after three laps walking up and down a set of stairs.
Let’s load the data before we do anything else.
Some of the variables, like Smoke
, are coded as 0 or 1. These are called indicator variables, because they serve to “indicate” whether a case satisfies some condition, recording a 1 for cases that do satsify the condition and 0 for those that don’t. You may have also heard these called dummy variables.
Conceptually, an indicator is a categorical variable, so let’s create a modified dataset in which this is made explicit. This won’t affect the models we get but it will help R recognize which variables are categorical for plotting purposes.
Instead of Sex
, let’s also define a variable Male
so it’s clearer what condition is being indicated by a 1, and make this categorical as well.
The Smoke
variable is equal to 1 for students who said they smoke cigarettes, and 0 for those who said they don’t.
Let’s first fit a model that just uses this variable to predict active pulse rate.
Smoke1
instead of Smoke
: because we made Smoke
categorical, this coefficient is added to the prediction when Smoke
is 1, and not otherwise. What does the coefficient for Smoke1
tell us in context?Smoke1
is 0. What does this tell us in context?Let’s add Rest
to the model as a predictor.
Smoke = 0
). In other words, assume Smoke
is zero and simplify the prediction equation so it has the form of a model with Rest
as the only variable input.Smoke = 1
). Again, simplify the equation so it has the form of a model with Rest
as the only variable input.Here’s some code to plot the data, with Rest
on the x-axis, Active
on the y-axis, with the points colored according to Smoke
, and the two separate regression lines from the two subgroups plotted. Everything starting with scale_color_discrete()
is just cleaning up the legend on the plot.
## CAUTION: don't try to use this code with a model that has multiple quantitative
## predictors; it likely won't make sense
plotModel(rest_smoke_model) +
scale_color_discrete(
name = "Smoking Status",
labels = c(
"0" = "Nonsmoker",
"1" = "Smoker"))
Smoke1
in the full model?summary()
of the two predictor model. What null hypothesis is being tested by the t-test of the coefficient for Smoke1
? How should we interpret the results of the test? What do you make of the fact that the P-value here is so different from the one we had in the model that didn’t have Rest
as a predictor?The next model will be of the form
\[ \widehat{Active} = \beta_0 + \beta_{Smoke} Smoke + \beta_{Rest} Rest + \beta_{Smoke:Rest} Smoke \cdot Rest\]
The Smoke
variable is still an indicator, equal to 1 for smokers and 0 for non-smokers, and Rest
still measures resting pulse. The last term in the model involves multiplying these two variables together. This might seem like an odd thing to do, but we will soon see why this is useful.
Smoke
is 0, then what is Smoke * Rest
? If Smoke
is 1, what is Smoke * Rest
?Smoke
to 0, simplify the prediction equation above into the form of a line relating Rest
to Active
. What are the intercept and slope in terms of the betas?Smoke
to 1. Group constant terms and terms that involve Rest
, and factor out Rest
to write the equation in intercept slope form. This time the intercept and slope will be expressions involving more than one beta.Let’s fit the model and plot the subgroup lines on the data. Note: The model is the same regardless of what order we write the terms, but the plotting function only works properly if the quantitative variable is supplied first.
plotModel(smoke_rest_interaction_model) +
scale_color_discrete(
name = "Smoking Status",
labels = c(
"0" = "Nonsmoker",
"1" = "Smoker"))
Smoke * Rest
have? Will it be large or small in magnitude? What sign should the coefficient for Smoke
have? (Careful with this last one: remember that the intercept term tells us the predicted value when the predictor is equal to 0, so you’ll need to imagine what the lines would do if we took Rest
all the way down to 0) Check your intuitions by getting the summary()
output.Sometimes it can be useful to center variables before fitting a model; that is, transforming them by subtracting their mean, so that 0 after the transformation represents the mean of that variable, and the value tells us how far above (or below, for negative values) the mean a case is for that variable. Here’s the last model we fit but using centered pulse rates.
PulseModified <- PulseModified %>%
mutate(
RestCentered = Rest - mean(Rest),
ActiveCentered = Active - mean(Active)
)
rest_smoke_centered_model <-
lm(ActiveCentered ~ RestCentered + Smoke + Smoke:RestCentered, data = PulseModified)
summary(rest_smoke_centered_model)
##
## Call:
## lm(formula = ActiveCentered ~ RestCentered + Smoke + Smoke:RestCentered,
## data = PulseModified)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.050 -9.971 -2.088 7.260 63.778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.14563 1.05303 -0.138 0.890
## RestCentered 1.13352 0.10698 10.595 <2e-16 ***
## Smoke1 1.18051 3.42335 0.345 0.731
## RestCentered:Smoke1 0.02692 0.32603 0.083 0.934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.09 on 228 degrees of freedom
## Multiple R-squared: 0.3655, Adjusted R-squared: 0.3572
## F-statistic: 43.78 on 3 and 228 DF, p-value: < 2.2e-16