STAT 213: Indicator Variables Lab

Partnership Roles

Whichever person has the ObieID which is first alphabetically will be the “reader” for this lab. The reader’s job is to share your screen, and read the document aloud. Your other job is to type what your partner tells you to type.
The other person (or the person whose ObieID comes next alphabetically, if there are three of you) is the “director”. You are the one who, for this assignment, tells the “reader” what to type. You can also ask the reader to pause so you can re-read something, or ask clarification questions, etc. You should type the same thing in your copy of the lab document, so you have a record, but the reader is the one sharing their screen, and you shouldn’t move ahead until everyone is ready.
If you are in a group of three, the third person is the “co-director”: you and the “director” should trade off in this role between sections, and when you’re not in charge, your job is to jump in to ask questions and/or offer suggestions.

The learning goals of this lab are

To understand how binary predictors work in multiple regression models
To practice fitting and plotting multiple regression models in R

The Data

We will be working mostly with the Pulse data that I used for some examples in class. The data was collected from 232 statistics students, who provided their height, weight, sex, exercise and smoking habits via a survey, and measured their pulse rate before and after three laps walking up and down a set of stairs.

Let’s load the data before we do anything else.

library(Stat2Data)
data(Pulse)

Models With Indicator Variables

Some of the variables, like Smoke, are coded as 0 or 1. These are called indicator variables, because they serve to “indicate” whether a case satisfies some condition, recording a 1 for cases that do satsify the condition and 0 for those that don’t. You may have also heard these called dummy variables.

Conceptually, an indicator is a categorical variable, so let’s create a modified dataset in which this is made explicit. This won’t affect the models we get but it will help R recognize which variables are categorical for plotting purposes.

Instead of Sex, let’s also define a variable Male so it’s clearer what condition is being indicated by a 1, and make this categorical as well.

PulseModified <- Pulse %>%
  mutate(
    Male = factor(1 - Sex),
    Smoke = factor(Smoke)
  )

A single binary predictor

The Smoke variable is equal to 1 for students who said they smoke cigarettes, and 0 for those who said they don’t.

Let’s first fit a model that just uses this variable to predict active pulse rate.

smoke_model <- lm(Active ~ Smoke, data = PulseModified)

Get the coefficients and write out the regression equation for this model. The output will have a coefficient for Smoke1 instead of Smoke: because we made Smoke categorical, this coefficient is added to the prediction when Smoke is 1, and not otherwise. What does the coefficient for Smoke1 tell us in context?

SOLUTION

Perform a t-test of the null hypothesis that the population coefficient for Smoke1 is 0. What does this tell us in context?

SOLUTION

What percentage of the variation in the active pulse rates in this data is predictable using smoking status?

SOLUTION

One quantitative and one binary predictor

Let’s add Rest to the model as a predictor.

rest_smoke_model <- lm(Active ~ Rest + Smoke, data = PulseModified)

Write out the prediction equation for this model.

SOLUTION

Using this model (that is, without fitting any additional models), write out the prediction equation for nonsmokers (those who have Smoke = 0). In other words, assume Smoke is zero and simplify the prediction equation so it has the form of a model with Rest as the only variable input.

SOLUTION

Now write out the prediction equation for smokers (assume Smoke = 1). Again, simplify the equation so it has the form of a model with Rest as the only variable input.

SOLUTION

Here’s some code to plot the data, with Rest on the x-axis, Active on the y-axis, with the points colored according to Smoke, and the two separate regression lines from the two subgroups plotted. Everything starting with scale_color_discrete() is just cleaning up the legend on the plot.

## CAUTION: don't try to use this code with a model that has multiple quantitative
## predictors; it likely won't make sense
plotModel(rest_smoke_model) +
  scale_color_discrete(
    name   = "Smoking Status",
    labels = c(
      "0" = "Nonsmoker", 
      "1" = "Smoker"))

Considering the plot and the two “subgroup” models you wrote out. How do they differ? What does this tell us about how to interpret the coefficient for Smoke1 in the full model?

SOLUTION

Examine the summary() of the two predictor model. What null hypothesis is being tested by the t-test of the coefficient for Smoke1? How should we interpret the results of the test? What do you make of the fact that the P-value here is so different from the one we had in the model that didn’t have Rest as a predictor?

SOLUTION

A model with an interaction term

The next model will be of the form

\[ \widehat{Active} = \beta_0 + \beta_{Smoke} Smoke + \beta_{Rest} Rest + \beta_{Smoke:Rest} Smoke \cdot Rest\]

The Smoke variable is still an indicator, equal to 1 for smokers and 0 for non-smokers, and Rest still measures resting pulse. The last term in the model involves multiplying these two variables together. This might seem like an odd thing to do, but we will soon see why this is useful.

If Smoke is 0, then what is Smoke * Rest? If Smoke is 1, what is Smoke * Rest?

SOLUTION

Setting Smoke to 0, simplify the prediction equation above into the form of a line relating Rest to Active. What are the intercept and slope in terms of the betas?

SOLUTION

Now do the same setting Smoke to 1. Group constant terms and terms that involve Rest, and factor out Rest to write the equation in intercept slope form. This time the intercept and slope will be expressions involving more than one beta.

SOLUTION

Compare the intercepts and slopes of these two subgroup models. How can we interpret each of the betas?

SOLUTION

Let’s fit the model and plot the subgroup lines on the data. Note: The model is the same regardless of what order we write the terms, but the plotting function only works properly if the quantitative variable is supplied first.

smoke_rest_interaction_model <- lm(Active ~ Rest + Smoke + Smoke:Rest, data = PulseModified)

plotModel(smoke_rest_interaction_model) +
  scale_color_discrete(
    name   = "Smoking Status",
    labels = c(
      "0" = "Nonsmoker", 
      "1" = "Smoker"))

Looking at the plot and before looking at the summary of the model, what sign should the coefficient for Smoke * Rest have? Will it be large or small in magnitude? What sign should the coefficient for Smoke have? (Careful with this last one: remember that the intercept term tells us the predicted value when the predictor is equal to 0, so you’ll need to imagine what the lines would do if we took Rest all the way down to 0) Check your intuitions by getting the summary() output.

SOLUTION

Bonus Material: Centering a Predictor

Sometimes it can be useful to center variables before fitting a model; that is, transforming them by subtracting their mean, so that 0 after the transformation represents the mean of that variable, and the value tells us how far above (or below, for negative values) the mean a case is for that variable. Here’s the last model we fit but using centered pulse rates.

PulseModified <- PulseModified %>%
  mutate(
    RestCentered = Rest - mean(Rest),
    ActiveCentered = Active - mean(Active)
  )
rest_smoke_centered_model <- 
  lm(ActiveCentered ~ RestCentered + Smoke + Smoke:RestCentered, data = PulseModified)
summary(rest_smoke_centered_model)

## 
## Call:
## lm(formula = ActiveCentered ~ RestCentered + Smoke + Smoke:RestCentered, 
##     data = PulseModified)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.050  -9.971  -2.088   7.260  63.778 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.14563    1.05303  -0.138    0.890    
## RestCentered         1.13352    0.10698  10.595   <2e-16 ***
## Smoke1               1.18051    3.42335   0.345    0.731    
## RestCentered:Smoke1  0.02692    0.32603   0.083    0.934    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.09 on 228 degrees of freedom
## Multiple R-squared:  0.3655, Adjusted R-squared:  0.3572 
## F-statistic: 43.78 on 3 and 228 DF,  p-value: < 2.2e-16

Which coefficient(s) changed and which one(s) stayed the same when we centered the quantitative variables? For those that changed, how does their interpretation change now that we’re using centered values? (Hint: what is the role of the intercept, now that the meaning of zero has changed?)

STAT 213: Indicator Variables Lab

Colin Reimer Dawson

June 25, 2021

Partnership Roles

The Data

Models With Indicator Variables

A single binary predictor

SOLUTION

SOLUTION

SOLUTION

One quantitative and one binary predictor

SOLUTION

SOLUTION

SOLUTION

SOLUTION

SOLUTION

A model with an interaction term

SOLUTION

SOLUTION

SOLUTION

SOLUTION

SOLUTION

Bonus Material: Centering a Predictor

SOLUTION