library(mosaic)
library(Stat2Data)
library(arm) # for binnedplot()
data(ICU)
## Don't run in Markdown
?ICU
We want to estimate patients’ chance of surviving in the ICU, using Systolic Blood Pressure.
lin.BP.model <- glm(Survive ~ SysBP, data = ICU, family = "binomial")
plotModel(lin.BP.model)
Check the binned residuals
binnedplot(fitted(lin.BP.model), residuals(lin.BP.model, type = "pearson"),
nclass = 25)
Looks like there might be some nonlinearity here. Maybe we should try a quadratic?
quad.BP.model <-
glm(Survive ~ SysBP + I(SysBP^2), family = "binomial", data = ICU)
plotModel(quad.BP.model)
binnedplot(fitted(quad.BP.model), residuals(quad.BP.model, type = "pearson"), nclass = 15)
Not a whole lot better, to be honest.
summary(quad.BP.model)
##
## Call:
## glm(formula = Survive ~ SysBP + I(SysBP^2), family = "binomial",
## data = ICU)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0676 0.5014 0.5434 0.6291 1.8196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.909e+00 1.744e+00 -2.814 0.004886 **
## SysBP 8.460e-02 2.553e-02 3.313 0.000922 ***
## I(SysBP^2) -2.583e-04 9.053e-05 -2.854 0.004323 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 200.16 on 199 degrees of freedom
## Residual deviance: 183.68 on 197 degrees of freedom
## AIC: 189.68
##
## Number of Fisher Scoring iterations: 4
Based on the z test, the quadratic term seems to be helpful, though…
Let’s look at the residuals for the first-order model:
## Deviance Residuals
dresids <- residuals(lin.BP.model, type = "deviance")
histogram(dresids, breaks = 30)
## Pearson Residuals
presids <- residuals(lin.BP.model, type = "pearson")
histogram(presids, breaks = 30)
Oof. That point way over on the left looks bad.
plot(presids)
Negative 6 Pearson residual. Double oof.
Let’s try to figure out what point that is…
plot(presids)
## Don't run when Knitting
identify(presids)
## integer(0)
Let’s see what that is.
ICU[196,]
## ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
## 196 921 0 50 2 1 0 256 64 1
Their Systolic BP is pretty high. Where is it in the distribution?
favstats(~SysBP, data = ICU)
## min Q1 median Q3 max mean sd n missing
## 36 110 130 150 256 132.28 32.9521 200 0
gf_boxplot(~SysBP, data = ICU)
Not terribly surprising; it’s the highest BP in the whole dataset. That’s giving it a ton of leverage, on top of the influence it has from being an outlier for Survival relative to BP.
Let’s check out Cook’s distance (Note: doesn’t measure outlier status quite right with logistic regression, but it’s ok as a rough tool)
plot(lin.BP.model, which = 4)
Let’s try leaving it out of the fit, in case it changes the picture dramatically:
lin.BP.model.no.outlier <-
glm(Survive ~ SysBP, data = filter(ICU, ID != 921), family = "binomial")
plotModel(lin.BP.model.no.outlier)
quad.BP.model.no.outlier <-
glm(Survive ~ SysBP + I(SysBP^2), family = "binomial", data = filter(ICU, ID != 921))
plotModel(quad.BP.model.no.outlier )
How are our residuals now?
binnedplot(
fitted(quad.BP.model.no.outlier),
residuals(quad.BP.model.no.outlier, type = "pearson"),
nclass = 15)
presids <- residuals(quad.BP.model.no.outlier, type = "pearson")
histogram(presids, breaks = 30)
binnedplot(
fitted(lin.BP.model.no.outlier),
residuals(lin.BP.model, type = "pearson"),
nclass = 25)
presids <- residuals(lin.BP.model.no.outlier, type = "pearson")
histogram(presids, breaks = 30)