We have seen how to handle various combinations of categorical and quantitative predictor variables, but so far our response has always been quantitative. What happens if our response is categorical?

The dataset we will use in this handout consists of votes of 215 members of the U.S. House of Representatives in 2010 on the bill that would become the Affordable Care Act (AKA “Obamacare”).

  • The Vote variable is 1 for members that voted “yes” and 0 for members that vote “no”.

  • Several other variables about each representative are available; we will focus on just one: ObamaMargin, which is the margin by which then-president Obama’s popular vote total in the representative’s home district in the 2008 election exceeded or trailed Sen. John McCain’s vote total, as a percentage of the vote in the district.

  • For example if Obama got 55 percent of the vote and McCain got 44, we’d have ObamaMargin = 11. The reverse would yield ObamaMargin = -11

Here is a possible linear regression model to predict Vote:

\[ Vote_i = \beta_0 + \beta_1 ObamaMargin_i + \varepsilon_i\]

  1. Using ordinary linear regression, the predicted response value represents the mean response at particular values of predictors. What does the mean of a binary response (coded as 0 or 1) represent, more intuitively}? (Hint: What does the sum of a bunch of 0s and 1s tell us?)

RESPONSE

In a sample, the mean of a binary variable is the proportion of 1s. In the population/“long run”, we can think of the predicted value as the probability of a 1.

The relevant data for the model above is in the file obamacare.csv, which is available at http://colindawson.net/data/obamacare.csv. Let’s read in the data from the URL with read.file(), and fit the above simple linear regression model.

  1. Find the predicted probability of a “yes” vote in the following three sample districts:

    • a district where Obama and McCain tied
    • A “deep red” district where Obama’s total was 40 percentage points lower than McCain’s
    • A “deep blue” district where Obama’s total was 55 points higher than McCain’s

SOLUTION

##         1 
## 0.4388256
##           1 
## -0.03096779
##        1 
## 1.084792

As you can see, if we use ordinary linear regression with a binary response, we will sometimes get nonsensical predictions. We would like a model for which the predicted probability of “success” asymptotically heads toward 0 or 1, instead of going negative or exceeding 1. Intuitively, we’d like the prediction function to have an stretched S-shape, flattening out on the extremes.

One \(S\)-shaped function has the form

\[f(x) = \frac{e^x}{1 + e^x}\]

where \(e\) is the irrational number \(e\), which is the base of the natural logarithm (approximately equal to 2.71). We could easily use any other base here; \(e\) is the standard choice because it makes calculus (which, remember, is behind the process of finding the coefficients in a linear model that maximize fit to data) a little bit simpler.

The function above is called the logistic function. One of the first uses of this function was in the mid 19th century, as a model of population growth in the context of limited resources: at low populations, growth is slow, but picks up exponentially as having more organisms makes population growth faster, which makes more organisms, etc., until the population starts to hit the capacity limit of the resources in the environment, at which point growth slows and approaches a saturation point.

  1. If \(x\) can be any number, positive or negative, what are the smallest and largest values the logistic function can attain? (Remember that raising a power to a negative exponent is equivalent to dividing 1 by the number raised to the equivalent positive exponent)

SOLUTION

The logistic function is convenient for modeling the probability that something happens, because it is naturally bounded in the range that probabilties can be. However, if we want to use it as a model relating a predictor to a probability, we need to generalize the form above somewhat, introducing parameters that allow the relationship between \(x\) and the probability to be flexible.

The following function generalizes the logistic function by introducing parameters \(\beta_0\) and \(\beta_1\), which play roles similar to that of an intercept and slope in linear regression

\[ f(X) = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}} \]

The following code chunk creates an R object to represent this function.

  1. The following chunk defines some values for \(\beta_0\) and \(\beta_1\), and plots the function as a function of \(x\) only, using those parameter values. Run this chunk a few times, choosing different values of beta0 and beta1 each time. (try values between -3 and 3 for beta0 and between -1 and 1 for beta1). What effect does changing beta0 seem to have on the prediction curve? What about beta1?


SOLUTION

The exponent in the function above is a linear function of \(X\): if we were predicting a quantitative response variable using \(X\), the prediction function would be in that form.

Define the function \(g(X)\) to be this linear function:

\[g(X) = \beta_0 + \beta_1 \cdot X\]

The function we graphed above can therefore be written as

\[ f(X) = \text{logistic}(g(X)) = \frac{e^{g(X)}}{1 + e^{g(X)}} \]

The inverse of the logistic function, which transforms \(f(X)\) back into \(g(X)\) is called the logit function. Looking at the logistic and logit functions side by side, we have

\[f(X) = \text{logistic}(g(X)) = \frac{e^{g(X)}}{1 + e^{g(X)}} = \frac{e^{\beta_0 + \beta_1 X}}{1 + e^{\beta_0 + \beta_1 X}}\]

and, going the other direction:

\[g(X) = \text{logit}(f(X)) = \log \left(\frac{f(X)}{1 - f(X)}\right)\]

Since \(g(X)\) was defined as \(\beta_0 + \beta_1 X\), if we think of \(f(X)\) as a model of how the probability that something happens depends on \(X\), then the corresponding model of the logit of the probability (which is called the log odds that the thing we are modeling happens) is a linear function of X.

We can also write down a model of the odds that the thing of interest happens, which is a step in between the probability and the log odds. Odds is defined in terms of probability as:

\[\text{Odds}(\text{Thing Happens}) = \frac{\text{Probability(Thing Happens)}}{\text{Probability(Thing Doesn't Happen)}}\]

If the probability that a thing happens is modeled as a function of \(X\) using the logistic function

\[ \text{Probability(Thing Happens)} = \text{logistic(g(X))} = \text{logistic}(\beta_0+\beta_1 X)\]

Then the odds that the thing happens is just

\[ \text{Odds(Thing Happens)} = e^{g(X)} = e^{\beta_0 + \beta_1 X} \]

  1. We will get more into the details and interpretation of the coefficients in logistic models later, but for now, do the following to fit and plot a logistic curve to predict the probability of a “yes” vote on the Affordable Care Act in terms of the 2008 election margin in favor of Obama in the congressperson’s district. Roughly what election margin in a district produces a predicted probability of a “yes” vote on the Affordable Care Act of about 50%? (Just eyeball it from the plot)

## (Intercept) ObamaMargin 
##  -0.3749994   0.1241132