We have seen how to handle various combinations of categorical and quantitative predictor variables, but so far our response has always been quantitative. What happens if our response is categorical?
The dataset we will use in this handout consists of votes of 215 members of the U.S. House of Representatives in 2010 on the bill that would become the Affordable Care Act (AKA “Obamacare”).
The Vote
variable is 1 for members that voted “yes” and 0 for members that vote “no”.
Several other variables about each representative are available; we will focus on just one: ObamaMargin
, which is the margin by which then-president Obama’s popular vote total in the representative’s home district in the 2008 election exceeded or trailed Sen. John McCain’s vote total, as a percentage of the vote in the district.
For example if Obama got 55 percent of the vote and McCain got 44, we’d have ObamaMargin = 11
. The reverse would yield ObamaMargin = -11
Here is a possible linear regression model to predict Vote
:
\[ Vote_i = \beta_0 + \beta_1 ObamaMargin_i + \varepsilon_i\]
In a sample, the mean of a binary variable is the proportion of 1s. In the population/“long run”, we can think of the predicted value as the probability of a 1.
The relevant data for the model above is in the file obamacare.csv
, which is available at http://colindawson.net/data/obamacare.csv
. Let’s read in the data from the URL with read.file()
, and fit the above simple linear regression model.
library(mosaic) # We did this already above but useful as a reminder that it's needed for read.file
Obamacare <- read.file("http://colindawson.net/data/obamacare.csv")
linearVoteModel <- lm(Vote ~ ObamaMargin, data = Obamacare)
Find the predicted probability of a “yes” vote in the following three sample districts:
## 1
## 0.4388256
## 1
## -0.03096779
## 1
## 1.084792
As you can see, if we use ordinary linear regression with a binary response, we will sometimes get nonsensical predictions. We would like a model for which the predicted probability of “success” asymptotically heads toward 0 or 1, instead of going negative or exceeding 1. Intuitively, we’d like the prediction function to have an stretched S-shape, flattening out on the extremes.
One \(S\)-shaped function has the form
\[f(x) = \frac{e^x}{1 + e^x}\]
where \(e\) is the irrational number \(e\), which is the base of the natural logarithm (approximately equal to 2.71). We could easily use any other base here; \(e\) is the standard choice because it makes calculus (which, remember, is behind the process of finding the coefficients in a linear model that maximize fit to data) a little bit simpler.
The function above is called the logistic function. One of the first uses of this function was in the mid 19th century, as a model of population growth in the context of limited resources: at low populations, growth is slow, but picks up exponentially as having more organisms makes population growth faster, which makes more organisms, etc., until the population starts to hit the capacity limit of the resources in the environment, at which point growth slows and approaches a saturation point.
The logistic function is convenient for modeling the probability that something happens, because it is naturally bounded in the range that probabilties can be. However, if we want to use it as a model relating a predictor to a probability, we need to generalize the form above somewhat, introducing parameters that allow the relationship between \(x\) and the probability to be flexible.
The following function generalizes the logistic function by introducing parameters \(\beta_0\) and \(\beta_1\), which play roles similar to that of an intercept and slope in linear regression
\[ f(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \]
The following code chunk creates an R object to represent this function.
parameterized_logistic <- function(x, beta0, beta1)
{
exponent <- beta0 + beta1 * x
probability <- exp(exponent) / (1 + exp(exponent))
return(probability)
}
beta0
and beta1
each time. (try values between -3 and 3 for beta0
and between -1 and 1 for beta1
). What effect does changing beta0
seem to have on the prediction curve? What about beta1
?beta0 <- 1
beta1 <- -1
gf_fun(
formula = parameterized_logistic(x, beta0, beta1) ~ x,
xlim = c(-7, 7)) +
scale_x_continuous(name = "X", breaks = -7:7) +
scale_y_continuous(name = "Probability that Y = 1", breaks = seq(0,1,0.1))
The exponent in the function above is a linear function of \(X\): if we were predicting a quantitative response variable using \(X\), the prediction function would be in that form.
Define the function \(g(X)\) to be this linear function:
\[g(X) = \beta_0 + \beta_1 \cdot X\]
The function we graphed above can therefore be written as
\[ f(X) = \text{logistic}(g(X)) = \frac{e^{g(X)}}{1 + e^{g(X)}} \]
The inverse of the logistic function, which transforms \(f(X)\) back into \(g(X)\) is called the logit function. Looking at the logistic and logit functions side by side, we have
\[f(X) = \text{logistic}(g(X)) = \frac{e^{g(X)}}{1 + e^{g(X)}} = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\]
and, going the other direction:
\[g(X) = \text{logit}(f(X)) = \log \left(\frac{f(X)}{1 - f(X)}\right)\]
Since \(g(X)\) was defined as \(\beta_0 + \beta_1 X\), if we think of \(f(X)\) as a model of how the probability that something happens depends on \(X\), then the corresponding model of the logit of the probability (which is called the log odds that the thing we are modeling happens) is a linear function of X.
We can also write down a model of the odds that the thing of interest happens, which is a step in between the probability and the log odds. Odds is defined in terms of probability as:
\[\text{Odds}(\text{Thing Happens}) = \frac{\text{Probability(Thing Happens)}}{\text{Probability(Thing Doesn't Happen)}}\]
If the probability that a thing happens is modeled as a function of \(X\) using the logistic function
\[ \text{Probability(Thing Happens)} = \text{logistic(g(X))} = \text{logistic}(\beta_0+\beta_1 X)\]
Then the odds that the thing happens is just
\[ \text{Odds(Thing Happens)} = e^{g(X)} = e^{\beta_0 + \beta_1 X} \]
### Fits a logistic regression model for a binary response
logisticVoteModel <-
glm(Vote ~ ObamaMargin, family = "binomial", data = Obamacare)
plotModel(logisticVoteModel) +
scale_x_continuous(breaks = seq(-60, 100, by = 10)) +
scale_y_continuous(breaks = seq(0,1, by = 0.1))
## (Intercept) ObamaMargin
## -0.3749994 0.1241132