Diagnosing Outliers in Logistic Regression

library(mosaic)
library(Stat2Data)
library(arm) # for binnedplot()

data(ICU)
## Don't run in Markdown
?ICU

We want to estimate patients’ chance of surviving in the ICU, using Systolic Blood Pressure.

lin.BP.model <- glm(Survive ~ SysBP, data = ICU, family = "binomial")
plotModel(lin.BP.model)

Check the binned residuals

binnedplot(fitted(lin.BP.model), residuals(lin.BP.model, type = "pearson"), 
           nclass = 25)

Looks like there might be some nonlinearity here. Maybe we should try a quadratic?

quad.BP.model <- 
  glm(Survive ~ SysBP + I(SysBP^2), family = "binomial", data = ICU)
plotModel(quad.BP.model)

binnedplot(fitted(quad.BP.model), residuals(quad.BP.model, type = "pearson"), nclass = 15)

Not a whole lot better, to be honest.

summary(quad.BP.model)

## 
## Call:
## glm(formula = Survive ~ SysBP + I(SysBP^2), family = "binomial", 
##     data = ICU)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0676   0.5014   0.5434   0.6291   1.8196  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.909e+00  1.744e+00  -2.814 0.004886 ** 
## SysBP        8.460e-02  2.553e-02   3.313 0.000922 ***
## I(SysBP^2)  -2.583e-04  9.053e-05  -2.854 0.004323 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 200.16  on 199  degrees of freedom
## Residual deviance: 183.68  on 197  degrees of freedom
## AIC: 189.68
## 
## Number of Fisher Scoring iterations: 4

Based on the z test, the quadratic term seems to be helpful, though…

Let’s look at the residuals for the first-order model:

## Deviance Residuals
dresids <- residuals(lin.BP.model, type = "deviance")
histogram(dresids, breaks = 30)

## Pearson Residuals
presids <- residuals(lin.BP.model, type = "pearson")
histogram(presids, breaks = 30)

Oof. That point way over on the left looks bad.

plot(presids)

Negative 6 Pearson residual. Double oof.

Let’s try to figure out what point that is…

plot(presids)
## Don't run when Knitting
identify(presids)

## integer(0)

Let’s see what that is.

ICU[196,]

##      ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
## 196 921       0  50        2   1         0   256    64         1

Their Systolic BP is pretty high. Where is it in the distribution?

favstats(~SysBP, data = ICU)

##  min  Q1 median  Q3 max   mean      sd   n missing
##   36 110    130 150 256 132.28 32.9521 200       0

gf_boxplot(~SysBP, data = ICU)

Not terribly surprising; it’s the highest BP in the whole dataset. That’s giving it a ton of leverage, on top of the influence it has from being an outlier for Survival relative to BP.

Let’s check out Cook’s distance (Note: doesn’t measure outlier status quite right with logistic regression, but it’s ok as a rough tool)

plot(lin.BP.model, which = 4)

Let’s try leaving it out of the fit, in case it changes the picture dramatically:

lin.BP.model.no.outlier <- 
  glm(Survive ~ SysBP, data = filter(ICU, ID != 921), family = "binomial")
plotModel(lin.BP.model.no.outlier)

quad.BP.model.no.outlier <- 
  glm(Survive ~ SysBP + I(SysBP^2), family = "binomial", data = filter(ICU, ID != 921))
plotModel(quad.BP.model.no.outlier  )

How are our residuals now?

binnedplot(
  fitted(quad.BP.model.no.outlier), 
  residuals(quad.BP.model.no.outlier, type = "pearson"), 
  nclass = 15)

presids <- residuals(quad.BP.model.no.outlier, type = "pearson")
histogram(presids, breaks = 30)

binnedplot(
  fitted(lin.BP.model.no.outlier), 
  residuals(lin.BP.model, type = "pearson"), 
  nclass = 25)

presids <- residuals(lin.BP.model.no.outlier, type = "pearson")
histogram(presids, breaks = 30)

Diagnosing Outliers in Logistic Regression

Colin Dawson

4/17/2020