Choosing and Assessing Logistic Regression Models

Learning Goals

  • Develop a feel for possible choices of model when we have a binary response
  • Practice fitting and assessing such models

Preliminaries and Data

In addition to the usual tidyverse and mosaic packages, we will need the arm and plotly packages for some of the residual plots we’ll be using.

We will also be using the ICU dataset from the Stat2Data package.

Having loaded the packages we need, let’s load the dataset next.

  1. Run ?ICU at the console to see the documentation for the dataset, so you know what the cases are and how each variable is defined.

We want to estimate patients’ chance of surviving in the ICU, using Systolic Blood Pressure. The response variable will be Survive throughout this lab. For this first model the only predictor is SysBP.

Fitting the Model

Let’s fit the model first.

Now we can plot the prediction curve on the probability scale using the convenient plotModel() function from mosaic.

  1. Examine the coefficients for this model. Write the prediction equation for the odds of Survival, and interpret the coefficient for SysBP in terms of the relationship between SysBP and the odds of Survival.

SOLUTION

Checking conditions

Logit linearity

The SysBP variable is not binned, so to check residuals we’ll want to bin the residuals according to the predicted probabilities. Since there are 200 cases in the dataset, 25 bins should give us a decent level of granularity, with about 8 datapoints in each bin. We could use either the deviance or Pearson residuals here; I’m using Pearson.

This doesn’t look too bad… but we already had an expectation that the “best” blood pressures would be in the middle, and it does seem like theres a slight hint of downward curvature to the residuals, so let’s try fitting a quadratic.

The points are more tightly clustered on the right now, reflecting the fact that our new prediction curve predicts high survival chances in the blood pressure range where most of the data is. The other difference is that we no longer have consistently negative residuals at the two ends. So this seems like it’s potentially something of an improvement.

Hypothesis Tests

Since the two models differ by only one coefficient, we can compare them using one of two hypothesis tests: either a \(z\)-test of the quadratic coefficient, or a nested Likelihood Ratio Test. With linear regression models, these would yield exactly the same P-values; with logistic regression models this isn’t the case, as they are based on somewhat different asymptotic approximations of distributions, but they should still be close.

Here’s the nested Likelihood Ratio Test.

## Analysis of Deviance Table
## 
## Model 1: Survive ~ SysBP
## Model 2: Survive ~ SysBP + I(SysBP^2)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       198     191.34                        
## 2       197     183.68  1   7.6587  0.00565 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here’s the summary() output of the quadratic model, which includes the \(z\)-test of the quadratic coefficient:

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -4.909      1.744  -2.814    0.005
## SysBP          0.085      0.026   3.313    0.001
## I(SysBP^2)     0.000      0.000  -2.854    0.004
  1. What conclusion would you draw from these tests about whether the complexity introduced by the quadratic appears to be justified?

SOLUTION

Checking for outliers

Before we take the hypothesis test too seriously, though, it’s also worth looking at the distributons of the individual residuals to check for potential outliers, since these can exert a large influence on the shape of the fitted curve, and won’t necessarily show up when we bin the residuals.

Let’s look at histograms of the residuals for the first-degree model.

For code readability, let’s collect both the deviance and Pearson residuals for these two models as named objects.

  1. Create histograms of these two sets of residuals. Does it look like there are potential outliers?

SOLUTION

QQ Plots aren’t really of use here since we don’t expect Normally distributed residuals, but it can be useful to look at boxplots.

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

With 200 datapoints, we expect to have a few residuals around -2, and maybe even one or two around -3, but a Pearson residual of -6 for the first model is pretty surprising.

The quadratic model doesn’t have any major standout residuals, though to be fair, that could be because the outlier is “hiding itself” in the quadratic model by pulling the curve to it. That’s more likely to be the case if it also has high leverage.

Let’s try to figure out what point it is that has that -6. The ggplotly() function from the plotly package can be used with an existing plot to add some hovertext, allowing us to see more information about individual datapoints.

In order to see what datapoint it is, let’s make a column that contains the rownames, and include that in our plot (as it happens, this dataset already has an ID column that we could use, but it’s useful to be able to do this in case you have a dataset that doesn’t have an ID column)

It looks like the case in row 196 is the one with the big negative residual in the first-order model. Let’s find out more about this person.

##    ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
## 1 921       0  50        2   1         0   256    64         1

This person is the patient with ID 921: a 50-year-old female admitted on an emergency basis with a systolic blood pressure of 256. That’s pretty high! Let’s see how it compares to the other blood pressures in the data by getting some summary statistics and a boxplot of the SysBP variable.

  1. Use favstats() and gf_boxplot() to examine the distribution of blood pressures. Where does this one sit in the distribution?

SOLUTION

Not surprisingly, this is the highest BP in the whole dataset. That gives it a ton of leverage, on top of the influence it has from being an outlier (the patient in question didn’t survive despite high predicted survival chance according to the first-order model).

Let’s check out Cook’s distance (Note: this doesn’t measure “outlierness” quite right with logistic regression, but it’s ok as a rough tool)

The absolute magnitude isn’t extraordinarily high, but it definitely sticks out, having more three times as much influence as the next most influential point. Let’s create a dataset that excludes this potential outlier, and refit our models, in case it changes the picture dramatically:

  1. Fit and plot the first- and second-order models using this dataset.

SOLUTION

  1. How are our residuals now? Look at binned plots and histograms of residuals for the models fit without this patient.

SOLUTION

  1. Perform relevant hypothesis tests using these models. Do you reach a different conclusion with and without the potential outlier included?

SOLUTION