In addition to the usual tidyverse
and mosaic
packages, we will need the arm
and plotly
packages for some of the residual plots we’ll be using.
We will also be using the ICU
dataset from the Stat2Data
package.
library(tidyverse)
library(mosaic)
library(arm) # for binnedplot()
library(plotly)
library(Stat2Data)
Having loaded the packages we need, let’s load the dataset next.
?ICU
at the console to see the documentation for the dataset, so you know what the cases are and how each variable is defined.We want to estimate patients’ chance of surviving in the ICU, using Systolic Blood Pressure. The response variable will be Survive
throughout this lab. For this first model the only predictor is SysBP
.
Let’s fit the model first.
Now we can plot the prediction curve on the probability scale using the convenient plotModel()
function from mosaic
.
SysBP
in terms of the relationship between SysBP
and the odds of Survival.The SysBP
variable is not binned, so to check residuals we’ll want to bin the residuals according to the predicted probabilities. Since there are 200 cases in the dataset, 25 bins should give us a decent level of granularity, with about 8 datapoints in each bin. We could use either the deviance or Pearson residuals here; I’m using Pearson.
This doesn’t look too bad… but we already had an expectation that the “best” blood pressures would be in the middle, and it does seem like theres a slight hint of downward curvature to the residuals, so let’s try fitting a quadratic.
quad_BP_model <-
glm(Survive ~ SysBP + I(SysBP^2), family = "binomial", data = ICU)
plotModel(quad_BP_model)
The points are more tightly clustered on the right now, reflecting the fact that our new prediction curve predicts high survival chances in the blood pressure range where most of the data is. The other difference is that we no longer have consistently negative residuals at the two ends. So this seems like it’s potentially something of an improvement.
Since the two models differ by only one coefficient, we can compare them using one of two hypothesis tests: either a \(z\)-test of the quadratic coefficient, or a nested Likelihood Ratio Test. With linear regression models, these would yield exactly the same P-values; with logistic regression models this isn’t the case, as they are based on somewhat different asymptotic approximations of distributions, but they should still be close.
Here’s the nested Likelihood Ratio Test.
## Analysis of Deviance Table
##
## Model 1: Survive ~ SysBP
## Model 2: Survive ~ SysBP + I(SysBP^2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 198 191.34
## 2 197 183.68 1 7.6587 0.00565 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here’s the summary()
output of the quadratic model, which includes the \(z\)-test of the quadratic coefficient:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.909 1.744 -2.814 0.005
## SysBP 0.085 0.026 3.313 0.001
## I(SysBP^2) 0.000 0.000 -2.854 0.004
Before we take the hypothesis test too seriously, though, it’s also worth looking at the distributons of the individual residuals to check for potential outliers, since these can exert a large influence on the shape of the fitted curve, and won’t necessarily show up when we bin the residuals.
Let’s look at histograms of the residuals for the first-degree model.
For code readability, let’s collect both the deviance and Pearson residuals for these two models as named objects.
dresids1 <- residuals(lin_BP_model, type = "deviance")
presids1 <- residuals(lin_BP_model, type = "pearson")
dresids2 <- residuals(quad_BP_model, type = "deviance")
presids2 <- residuals(quad_BP_model, type = "pearson")
QQ Plots aren’t really of use here since we don’t expect Normally distributed residuals, but it can be useful to look at boxplots.
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
With 200 datapoints, we expect to have a few residuals around -2, and maybe even one or two around -3, but a Pearson residual of -6 for the first model is pretty surprising.
The quadratic model doesn’t have any major standout residuals, though to be fair, that could be because the outlier is “hiding itself” in the quadratic model by pulling the curve to it. That’s more likely to be the case if it also has high leverage.
Let’s try to figure out what point it is that has that -6. The ggplotly()
function from the plotly
package can be used with an existing plot to add some hovertext, allowing us to see more information about individual datapoints.
In order to see what datapoint it is, let’s make a column that contains the rownames, and include that in our plot (as it happens, this dataset already has an ID
column that we could use, but it’s useful to be able to do this in case you have a dataset that doesn’t have an ID column)
ICU %>%
mutate(rownumber = parse_number(rownames(.))) %>%
gf_point(presids1 ~ rownumber, data = .) %>%
ggplotly() # Enables hovertext
It looks like the case in row 196 is the one with the big negative residual in the first-order model. Let’s find out more about this person.
## ID Survive Age AgeGroup Sex Infection SysBP Pulse Emergency
## 1 921 0 50 2 1 0 256 64 1
This person is the patient with ID 921: a 50-year-old female admitted on an emergency basis with a systolic blood pressure of 256. That’s pretty high! Let’s see how it compares to the other blood pressures in the data by getting some summary statistics and a boxplot of the SysBP
variable.
favstats()
and gf_boxplot()
to examine the distribution of blood pressures. Where does this one sit in the distribution?Not surprisingly, this is the highest BP in the whole dataset. That gives it a ton of leverage, on top of the influence it has from being an outlier (the patient in question didn’t survive despite high predicted survival chance according to the first-order model).
Let’s check out Cook’s distance (Note: this doesn’t measure “outlierness” quite right with logistic regression, but it’s ok as a rough tool)
The absolute magnitude isn’t extraordinarily high, but it definitely sticks out, having more three times as much influence as the next most influential point. Let’s create a dataset that excludes this potential outlier, and refit our models, in case it changes the picture dramatically: