library(mosaic)
library(Stat2Data)
data(Leukemia)
We are trying to select a model to predict Resp
(will a Leukemia patient respond to treatment? (1
= yes, 0
= no)) from a set of diagnostic and vitals results.
Before we do any modeling, let’s set aside a test set.
set.seed(29747) # Don't forget to do this!
N <- nrow(Leukemia)
testN <- round(0.20 * N)
cat("Total N =",N,"; Test N (20%) =",testN)
## Total N = 51 ; Test N (20%) = 10
HeldOut <- rep(c("no","yes"), times = c(N - testN, testN))
Leukemia <- Leukemia %>%
mutate(HeldOut = sample(HeldOut))
Test <- filter(Leukemia, HeldOut == "yes")
Train <- filter(Leukemia, HeldOut == "no")
head(Test)
## Age Smear Infil Index Blasts Temp Resp Time Status HeldOut
## 1 28 70 70 14 10.0 1010 1 39 1 yes
## 2 31 72 72 5 2.3 988 1 20 1 yes
## 3 33 42 38 12 2.5 984 1 36 0 yes
## 4 37 71 71 15 4.4 1020 0 1 0 yes
## 5 43 74 63 4 0.1 986 0 2 0 yes
## 6 45 78 47 14 4.2 980 1 33 0 yes
We’ll restrict our models to the following predictors:
Age
(age in years)Temp
(highest temperature in 10 * degrees F)Smear
(differential percentage of “blasts”)Infil
(percent of marrow leukemia infiltrate)Index
(percentage labeling index of bone marrow leukemia cells)Blasts
(absolute number of blasts in thousands)Note: it’s not a good idea to jump into modeling in a domain that you don’t have any background in, especially when some of the variables are quite technical in nature.
I chose this example because it has a little bit of “messiness”, but not too much; that said, you wouldn’t want to do this for a project unless you know about Leukemia, or at least about these metrics.
We’ll check individual (quantitative) variables’ distributions first to see whether any obviously call for a transformation
bwplot(~Age, data = Train)
histogram(~Age, data = Train)
The Age
variable is fairly symmetric.
bwplot(~Temp, data = Train)
histogram(~Temp, data = Train)
The Temp
variable is kind of right-skewed
bwplot(~sqrt(Temp), data = Train)
histogram(~sqrt(Temp), data = Train)
Square root didn’t really help.
A log isn’t really sensible here, because temperature isn’t based on a meaningful zero. That is, ratios of degrees Fahrenheit are meaningless.
Square root has a bigger impact near 0, so let’s try shifting and then transforming.
bwplot(~sqrt(Temp - 980), data = Train)
histogram(~sqrt(Temp - 980), data = Train)
That looks nicer.
bwplot(~Smear, data = Train)
histogram(~Smear, data = Train)
This one isn’t too bad.
bwplot(~Infil, data = Train)
histogram(~Infil, data = Train)
This one is a bit left skewed, but not extreme.
bwplot(~Index, data = Train)
histogram(~Index, data = Train)
Looks pretty reasonable.
bwplot(~Blasts, data = Train)
histogram(~Blasts, data = Train)
That’s a lot of skew. It looks structural, though there may be one genuine outlier.
library(tidyverse)
Train %>%
rownames_to_column() %>%
plot(Blasts ~ rowname, data = .)
## Do not run when Knitting
## with(Train, identify(Blasts))
Let’s check whether that person still looks like an outlier after a log transformation.
However, log(0) is undefined. Are there true zeroes in the data?
min(~Blasts, data = Train)
## [1] 0.3
No true zeroes, so we can take a log.
Note: It can sometimes make sense to add a small constant to every observation before taking a log, if you have reason to believe that ratio comparisons are the most natural for that variable.
bwplot(~log(Blasts), data = Train)
histogram(~log(Blasts), data = Train)
That looks a lot better. That point probably isn’t an outlier after all.
We have chosen the following transformations: * Age –> Age (no transformation) * Temp –> sqrt(Temp - 980) * Smear –> Smear (no transformation) * Infil –> Infil (no transformation) * Index –> Index (no transformation) * Blasts –> log(Blasts)
Next, let’s check whether any of the variables are redundant with the others for the purposes of a linear model.
library(car)
## Since vif() expects a model, we'll fit a model with all predictors,
## only for the purposes of checking multicollinearity
fullModel <-
glm(Resp ~ Age + sqrt(Temp - 980) + Smear + Infil + Index + log(Blasts),
family = "binomial", data = Train)
vif(fullModel)
## Age sqrt(Temp - 980) Smear Infil
## 1.162497 1.483266 4.896380 5.790113
## Index log(Blasts)
## 1.573715 2.085865
Smear and Infil are fairly predictable from the others. Infil exceeds our VIF > 5 guideline. We won’t exclude these entirely, because they may not be predictable if we remove some other variables, but it’s something to keep in mind.
The main purpose of this step is to assess whether we might want any polynomial terms.
library(arm)
age.model <- glm(Resp ~ Age, family = "binomial", data = Train)
binnedplot(
fitted(age.model),
residuals(age.model, type = "pearson"),
nclass = 15)
There might be a bit of curvature there.
temp.model <- glm(
Resp ~ sqrt(Temp - 980),
family = "binomial",
data = Train)
binnedplot(
fitted(temp.model),
residuals(temp.model, type = "pearson"),
nclass = 15)
This one looks OKish.
smear.model <- glm(
Resp ~ Smear,
family = "binomial", data = Train)
binnedplot(
fitted(smear.model),
residuals(smear.model, type = "pearson"),
nclass = 15)
That one is a tough call.
index.model <- glm(
Resp ~ Index,
family = "binomial",
data = Train)
binnedplot(
fitted(index.model),
residuals(index.model, type = "pearson"),
nclass = 15)
Looks OK.
infil.model <- glm(
Resp ~ Infil, family = "binomial", data = Train)
binnedplot(
fitted(infil.model),
residuals(infil.model, type = "pearson"),
nclass = 15)
Probably OK?
blasts.model <- glm(
Resp ~ log(Blasts),
family = "binomial",
data = Train)
binnedplot(
fitted(blasts.model),
residuals(blasts.model, type = "pearson"),
nclass = 15)
Looks OK.
So, we may want to consider adding \(Age^2\)
Age and temperature are easy variables to collect, so let’s include those, and then consider which other predictors might be important.
As a quick first check, let’s see if any of them are useful:
baseModel <-
glm(Resp ~ Age + I(Age^2) + sqrt(Temp - 980),
data = Train,
family = "binomial")
fullModel <- glm(
Resp ~ Age + I(Age^2) + sqrt(Temp - 980) +
Smear + Index + Infil + log(Blasts),
data = Train, family = "binomial")
anova(baseModel, fullModel, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: Resp ~ Age + I(Age^2) + sqrt(Temp - 980)
## Model 2: Resp ~ Age + I(Age^2) + sqrt(Temp - 980) + Smear + Index + Infil +
## log(Blasts)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 37 44.493
## 2 33 23.243 4 21.25 0.0002825 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pretty strong evidence that at least some of those biomarkers are helpful, after controlling for Age and Temperature.
The AIC is analogous to Mallow’s Cp, and is a “penalized” measure of how well a model fits a dataset.
Since there are four markers in question, there are sixteen (= 2^4) possible combinations of these markers we could include in our model. We could fit all of these and examine their AIC, but to cut down on the number we consider, let’s first see if there is one of these markers that provides all the information we need on its own.
IndexModel <- glm(
Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index,
data = Train, family = "binomial"
)
SmearModel <- glm(
Resp ~ poly(Age,2) + sqrt(Temp - 980) + Smear,
data = Train, family = "binomial"
)
InfilModel <- glm(
Resp ~ poly(Age,2) + sqrt(Temp - 980) + Infil,
data = Train, family = "binomial"
)
BlastsModel <- glm(
Resp ~ poly(Age,2) + sqrt(Temp - 980) + log(Blasts),
data = Train, family = "binomial"
)
AIC(SmearModel)
## [1] 50.91185
AIC(BlastsModel)
## [1] 45.40245
AIC(InfilModel)
## [1] 48.47392
AIC(IndexModel)
## [1] 41.08856
AIC(fullModel)
## [1] 39.24328
The Index
marker comes close, but the full model still does a bit better.
Because Index comes so close to capturing everything in the full model, we likely want it included.
Next, let’s fit the various models with Index and one other marker, and see if any of these give us as good or better AIC as the full model.
IndexModelS <- glm(
Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Smear,
data = Train, family = "binomial"
)
IndexModelI <- glm(
Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Infil,
data = Train, family = "binomial"
)
IndexModelB <- glm(
Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + log(Blasts),
data = Train, family = "binomial"
)
AIC(IndexModelS)
## [1] 38.84814
AIC(IndexModelI)
## [1] 35.85037
AIC(IndexModelB)
## [1] 39.9999
AIC(fullModel)
## [1] 39.24328
Looks like Index
and Infil
together give us a potentially better model (at least, when fitting on this dataset) than including all four markers. But so do Index
and Smear
. What about all three?
ISIModel <- glm(
Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Smear + Infil,
data = Train, family = "binomial"
)
AIC(IndexModel)
## [1] 41.08856
AIC(IndexModelS)
## [1] 38.84814
AIC(IndexModelI)
## [1] 35.85037
AIC(IndexModelB)
## [1] 39.9999
AIC(ISIModel)
## [1] 37.51595
AIC(fullModel)
## [1] 39.24328
binnedplot(
fitted(IndexModel),
residuals(IndexModel, type = "pearson"))
binnedplot(
fitted(IndexModelS),
residuals(IndexModel, type = "pearson"))
binnedplot(
fitted(IndexModelI),
residuals(IndexModel, type = "pearson"))
binnedplot(
fitted(IndexModelB),
residuals(IndexModel, type = "pearson"))
binnedplot(
fitted(ISIModel),
residuals(IndexModel, type = "pearson"))
binnedplot(
fitted(fullModel),
residuals(IndexModel, type = "pearson"))
None of them show particular evidence of curvature.
One question we might have is whether the relationship between any of the biological markers and the response to treatment is modulated by Age. That is, should we interpret the markers differently depending on the age of the person?
This is equivalent to asking whether we want to include an interaction term between Age and any of the other variables.
AgeInteractionsModel <- glm(
Resp ~ (sqrt(Temp - 980) + Index + Infil + Smear + Blasts) * Age + I(Age^2),
family = "binomial", data = Train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
anova(fullModel, AgeInteractionsModel, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: Resp ~ Age + I(Age^2) + sqrt(Temp - 980) + Smear + Index + Infil +
## log(Blasts)
## Model 2: Resp ~ (sqrt(Temp - 980) + Index + Infil + Smear + Blasts) *
## Age + I(Age^2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 33 23.243
## 2 28 13.785 5 9.4582 0.09213 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Not much evidence for any interactions there.
We’ve settled on a pool of six models. All six include Temp
, Age
, Age^2
and Index
. Then we also include
Infil
Smear
Blasts
Infil
and Smear
Infil
, Smear
and Blasts
Let’s do cross-validation with these six models.
# source("http://colindawson.net/stat213/code/helper_functions.R")
source("~/213_site/code/helper_functions.R")
Some error metrics that make sense for logistic regression:
Normalized deviance: Residual deviance (on validation set) / sample size
We can use normalized.deviance()
from helper_functions.R
“Hard” misclassification rate: Count prediction as an “error” if fitted probability is on the “wrong side” of 0.5 and compute proportion of validation cases that are “misclassified”.
Can use hard.classification.error()
from helper_functions.R
False positive rate:: What proportion of the true 0s are misclassified as 1s?
false.positive.rate()
False negative rate:: What proportion of real 1s are misclassified as 0s
false.negative.rate()
“Soft” misclassification rate:: Count each point as a “partial error” according to the predicted probability of the wrong category.
soft.classification.error()
Concordance, discordance, and Somers’ D:
We’ll make a list of our six finalists, plus the “base” model (for comparison).
finalists <- list(
IndexOnly = IndexModel,
IndexInfil = IndexModelI,
IndexSmear = IndexModelS,
IndexBlasts = IndexModelB,
IndexSmearInfil = ISIModel,
AllFour = fullModel,
None = baseModel)
Now we’ll do 10-fold cross-validation, with 10 random splits for each (to reduce the impact of the choice of folds):
set.seed(8675309)
n <- nrow(Train)
## Normalized deviances (lower is better)
nds <- sapply(
finalists, do.k.fold.cv, data = Train, num.folds = 10,
error.metric = normalized.deviance, iterations = 10)
nds
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 1.340329 1.318492 1.337272 1.328550
## IndexSmearInfil AllFour None
## 1.325797 1.342344 1.432144
All six are relatively close compared to the gap between them and the “base” model.
## hard misclassification rate (lower is better)
hces <- sapply(
finalists, do.k.fold.cv, data = Train, num.folds = 10,
error.metric = hard.classification.error, iterations = 10)
hces
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.4540 0.4250 0.4520 0.4515
## IndexSmearInfil AllFour None
## 0.4290 0.4190 0.5930
The leader here is the model with all four predictors.
Let’s examine the false-positive and false-negative rates separately, since we presumably don’t care equally about false-positive and false-negative predictions.
fprs <- sapply(
finalists, do.k.fold.cv, data = Train, num.folds = 10,
error.metric = false.positive.rate, iterations = 10)
fprs
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.7901667 0.7573148 0.7491667 0.7802778
## IndexSmearInfil AllFour None
## 0.7652037 0.7116667 1.0000000
Lots of false positives for all models, but fewest for the model that includes all four markers.
What about false negatives?
fnrs <- sapply(
finalists, do.k.fold.cv, data = Train, num.folds = 10,
error.metric = false.negative.rate, iterations = 10)
fnrs
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0 0 0 0
## IndexSmearInfil AllFour None
## 0 0 0
No false negatives. None of the models have any cases of predicting that someone will fail to respond to treatment who actually does respond.
Seems like the models tend to be “optimistic” about responding to treatment overall.
How about “soft misclassification rate”?
sces <- sapply(
finalists, do.k.fold.cv, data = Train, num.folds = 10,
error.metric = soft.classification.error, iterations = 10)
sces
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.4740260 0.4682146 0.4701644 0.4703431
## IndexSmearInfil AllFour None
## 0.4695217 0.4715015 0.4982308
Interestingly, while “hard misclassification rate” pushed us toward the full model, weighting our misclassifications using the confidence in the predicted outcome leads us toward the models with Index
and Infil
, possibly with Smear
.
Next, concordance, discordance, and Somer’s d:
## Higher is better
concordances <-
sapply(
finalists, do.k.fold.cv, data = Train, num.folds = 10,
error.metric = concordance, iterations = 10)
concordances
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.8785417 0.8705688 0.8776620 0.8544676
## IndexSmearInfil AllFour None
## 0.8546164 0.7901786 0.7696759
## Lower is better
discordances <-
sapply(
finalists, do.k.fold.cv, data = Train, num.folds = 10,
error.metric = discordance, iterations = 10)
discordances
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.1368188 0.1570635 0.1367130 0.1699735
## IndexSmearInfil AllFour None
## 0.1320999 0.2027083 0.2717758
## Higher is better
somers.ds <-
sapply(
finalists, do.k.fold.cv, data = Train, num.folds = 10,
error.metric = somers.d, iterations = 10)
somers.ds
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.6787963 0.7703042 0.6848611 0.7555556
## IndexSmearInfil AllFour None
## 0.7749405 0.6706481 0.4272884
As with the normalized deviance and the soft misclassification rate, The two models with Index
and Infil
(with and without Smear
) look the best.
Which of these we would prefer depends on context; e.g., how invasive and expensive is the procedure to collect the Smear
variable, etc., but absent other information, I would tend to opt for the simpler model.
So, the winner is the model with Age
, Age^2
, Temp
, Index
and Infil
.
For this final model, let’s calculate our various performance measures on the held out test set.
## Hard misclassification rate
test.hce <- get.held.out.error(
IndexModelI, test.set = Test,
response.name = "Resp",
error.metric = hard.classification.error)
test.hce
## [1] 0.4
## False positive rate
test.fpr <- get.held.out.error(
IndexModelI, test.set = Test,
response.name = "Resp", error.metric = false.positive.rate)
test.fpr
## [1] 0.3333333
## False negative rate
test.fnrs <- get.held.out.error(
IndexModelI, test.set = Test,
response.name = "Resp", error.metric = false.negative.rate)
test.fnrs
## [1] 0.4285714
## Soft misclassification rate
test.sces <- get.held.out.error(
IndexModelI, test.set = Test,
response.name = "Resp", error.metric = soft.classification.error)
test.sces
## [1] 0.3749122
## Normalized deviance
test.nd <- get.held.out.error(
IndexModelI, test.set = Test,
response.name = "Resp", error.metric = normalized.deviance)
test.nd
## [1] 1.918711
test.somers.d <- get.held.out.error(
IndexModelI, test.set = Test,
response.name = "Resp", error.metric = somers.d)
test.somers.d
## [1] 0.6190476
FinalModel <-
glm(
Resp ~ Age + I(Age^2) + sqrt(Temp - 980) + Index + Infil,
family = "binomial", data = Leukemia)
FinalModel %>% coef() %>% round(3)
## (Intercept) Age I(Age^2) sqrt(Temp - 980)
## 6.727 -0.401 0.003 -0.724
## Index Infil
## 0.401 0.034
\[ \widehat{log odds} = 6.7 - 0.4\cdot Age + 0.003 \cdot Age^2 - 0.72 \cdot \sqrt{Temp - 980} + 0.4 \cdot Index + 0.034 \cdot Infil\]
FinalModel %>% coef() %>% exp() %>% round(5)
## (Intercept) Age I(Age^2) sqrt(Temp - 980)
## 834.61136 0.66957 1.00342 0.48484
## Index Infil
## 1.49323 1.03477
It seems that higher Index scores are associated with a higher probability of responding to treatment, after controlling for age, temperature, and percentage infiltrate, with the odds of response increasing by about 49% with each 1 point increase in Index relative to what is expected based on age and temperature.
Similarly, higher Infil
scores are associated with a higher probability of responding to treatment, after controlling for age, temperature, and Index
, with the odds of response increasing by about 3.4% for each 1 point difference in Infil
.
It seems that higher body temperatures (on a square root scale) are associated with a lower chance of recovery (controlling for age and biomarkers).
For each one unit increase in the square root of temperature in 10ths of a degree above 98F (relative to what is expected given Age, Index and Infil), the odds of responding to treatment go down by 52%.
For patients around 40 years of age, and adjusting for expected differences in biomarkers, an additional 5 years of age is associated with
odds.factor.45vs40 <- 0.66957^(45-40)*1.00342^(45^2-40^2)
odds.factor.45vs40
## [1] 0.5743148
about a 42% decrease in the odds of responding to treatment.
Around 50 years of age, an additional 5 years of age is associated with
odds.factor.55vs50 <- 0.66957^(55-50)*1.00342^(55^2-50^2)
odds.factor.55vs50
## [1] 0.8080259
odds.factor.65vs60 <- 0.66957^(65-60)*1.00342^(65^2-60^2)
odds.factor.65vs60
## [1] 1.136843
However, from 60 to 65, the odds of responding to treatment go back up, after controlling for temperature and biomarkers. This may be because 65-year-olds are expected to have worse biomarker indicators and/or temperature readings than 60-year-olds, and so if a 65-year-old presented with the same indicators as a 60-year-old, it might be an indication that it is a person with a stronger system.
Let’s look at the predicted recovery probabilities for a few example patients.
f.hat <- makeFun.logistic(FinalModel)
To see what sensible values to plug in for Index
and Infil
are, let’s look at a scatterplot and five number summaries for these markers.
fivenum(~Index, data = Leukemia)
## [1] 1.0 6.0 9.0 13.5 20.0
fivenum(~Infil, data = Leukemia)
## [1] 8.0 37.5 61.0 72.5 95.0
gf_point(Index ~ Infil, data = Leukemia)
Patient 1a: 20 years old with relatively high index and infil scores, and a normal body temperature (favorable scenario)
f.hat(Age = 20, Temp = 986, Index = 13.5, Infil = 72.5, interval = "confidence")
## Age Temp Index Infil pi.hat lwr upr
## 1 20 986 13.5 72.5 0.9979536 0.9066558 0.9999592
We estimate that a young patient with high index and infil scores and a normal body temp has at least a 90% chance of responding to treatment.
Patient 1b: same as Patient 1a but with a fever of 101F
f.hat(Age = 20, Temp = 1010, Index = 13.5, Infil = 72.5, interval = "confidence")
## Age Temp Index Infil pi.hat lwr upr
## 1 20 1010 13.5 72.5 0.9819743 0.6716373 0.9993112
If a patient like this has a fever, the probability of responding might well still be near 100%, but it might be down to something like 67%.
Patient 2a: 50 years old, moderate index and infil scores (near the median), normal body temp
f.hat(Age = 50, Temp = 986, Index = 9, Infil = 61, interval = "confidence")
## Age Temp Index Infil pi.hat lwr upr
## 1 50 986 9 61 0.2970462 0.1061308 0.6006274
Patients like this might have anywhere between a 10% and 60% chance of responding.
Patient 2b: Same as 2a but with a fever of 101F
f.hat(Age = 50, Temp = 1010, Index = 9, Infil = 61, interval = "confidence")
## Age Temp Index Infil pi.hat lwr upr
## 1 50 1010 9 61 0.04507679 0.004957011 0.3090543
The presence of a fever of 101F worsens the outlook for this patient considerably. The probability of responding goes down to somewhere between 0.5% and 31%.
Patient 3: 75 years old, fever of 101F, low index and infil (unfavorable scenario)
f.hat(Age = 75, Temp = 1010, Index = 6, Infil = 37.5, interval = "confidence")
## Age Temp Index Infil pi.hat lwr upr
## 1 75 1010 6 37.5 0.01209107 0.0004113198 0.2668784
Even if our model is pessimistic, we doubt that the probability of responding to treatment is higher than 27% or so for a patient like this, and more likely it’s less than 1%.