Loading packages and data

library(mosaic)
library(Stat2Data)
data(Leukemia)

Splitting into Train and Test

We are trying to select a model to predict Resp (will a Leukemia patient respond to treatment? (1 = yes, 0 = no)) from a set of diagnostic and vitals results.

Before we do any modeling, let’s set aside a test set.

set.seed(29747) # Don't forget to do this!
N <- nrow(Leukemia)
testN <- round(0.20 * N)
cat("Total N =",N,"; Test N (20%) =",testN)
## Total N = 51 ; Test N (20%) = 10
HeldOut <- rep(c("no","yes"), times = c(N - testN, testN))
Leukemia <- Leukemia %>%
  mutate(HeldOut = sample(HeldOut))
Test <- filter(Leukemia, HeldOut == "yes")
Train <- filter(Leukemia, HeldOut == "no")
head(Test)
##   Age Smear Infil Index Blasts Temp Resp Time Status HeldOut
## 1  28    70    70    14   10.0 1010    1   39      1     yes
## 2  31    72    72     5    2.3  988    1   20      1     yes
## 3  33    42    38    12    2.5  984    1   36      0     yes
## 4  37    71    71    15    4.4 1020    0    1      0     yes
## 5  43    74    63     4    0.1  986    0    2      0     yes
## 6  45    78    47    14    4.2  980    1   33      0     yes

Potential Predictors

We’ll restrict our models to the following predictors:

Note: it’s not a good idea to jump into modeling in a domain that you don’t have any background in, especially when some of the variables are quite technical in nature.

I chose this example because it has a little bit of “messiness”, but not too much; that said, you wouldn’t want to do this for a project unless you know about Leukemia, or at least about these metrics.

Checking for transformations

We’ll check individual (quantitative) variables’ distributions first to see whether any obviously call for a transformation

Age

bwplot(~Age, data = Train) 

histogram(~Age, data = Train)

Temperature

The Age variable is fairly symmetric.

bwplot(~Temp, data = Train)                

histogram(~Temp, data = Train)

The Temp variable is kind of right-skewed

bwplot(~sqrt(Temp), data = Train)          

histogram(~sqrt(Temp), data = Train)

Square root didn’t really help.

A log isn’t really sensible here, because temperature isn’t based on a meaningful zero. That is, ratios of degrees Fahrenheit are meaningless.

Square root has a bigger impact near 0, so let’s try shifting and then transforming.

bwplot(~sqrt(Temp - 980), data = Train)    

histogram(~sqrt(Temp - 980), data = Train)

That looks nicer.

Smear

bwplot(~Smear, data = Train)               

histogram(~Smear, data = Train)

This one isn’t too bad.

Infil

bwplot(~Infil, data = Train)               

histogram(~Infil, data = Train)

This one is a bit left skewed, but not extreme.

bwplot(~Index, data = Train)               

histogram(~Index, data = Train)

Looks pretty reasonable.

bwplot(~Blasts, data = Train)              

histogram(~Blasts, data = Train)           

That’s a lot of skew. It looks structural, though there may be one genuine outlier.

library(tidyverse)
Train %>%
  rownames_to_column() %>%
  plot(Blasts ~ rowname, data = .)

## Do not run when Knitting
## with(Train, identify(Blasts))

Let’s check whether that person still looks like an outlier after a log transformation.

However, log(0) is undefined. Are there true zeroes in the data?

min(~Blasts, data = Train)
## [1] 0.3

No true zeroes, so we can take a log.

Note: It can sometimes make sense to add a small constant to every observation before taking a log, if you have reason to believe that ratio comparisons are the most natural for that variable.

bwplot(~log(Blasts), data = Train)

histogram(~log(Blasts), data = Train)

That looks a lot better. That point probably isn’t an outlier after all.

Our pool of variables

We have chosen the following transformations: * Age –> Age (no transformation) * Temp –> sqrt(Temp - 980) * Smear –> Smear (no transformation) * Infil –> Infil (no transformation) * Index –> Index (no transformation) * Blasts –> log(Blasts)

Checking for multicollinearity

Next, let’s check whether any of the variables are redundant with the others for the purposes of a linear model.

library(car)
## Since vif() expects a model, we'll fit a model with all predictors,
## only for the purposes of checking multicollinearity
fullModel <- 
  glm(Resp ~ Age + sqrt(Temp - 980) + Smear + Infil + Index + log(Blasts),
     family = "binomial", data = Train)
vif(fullModel)
##              Age sqrt(Temp - 980)            Smear            Infil 
##         1.162497         1.483266         4.896380         5.790113 
##            Index      log(Blasts) 
##         1.573715         2.085865

Smear and Infil are fairly predictable from the others. Infil exceeds our VIF > 5 guideline. We won’t exclude these entirely, because they may not be predictable if we remove some other variables, but it’s something to keep in mind.

Looking at single predictor models

The main purpose of this step is to assess whether we might want any polynomial terms.

library(arm)
age.model <- glm(Resp ~ Age, family = "binomial", data = Train)
binnedplot(
  fitted(age.model), 
  residuals(age.model, type = "pearson"), 
  nclass = 15)

There might be a bit of curvature there.

temp.model <- glm(
  Resp ~ sqrt(Temp - 980), 
  family = "binomial", 
  data = Train)
binnedplot(
  fitted(temp.model), 
  residuals(temp.model, type = "pearson"), 
  nclass = 15)

This one looks OKish.

smear.model <- glm(
  Resp ~ Smear, 
  family = "binomial", data = Train)
binnedplot(
  fitted(smear.model), 
  residuals(smear.model, type = "pearson"), 
  nclass = 15)

That one is a tough call.

index.model <- glm(
  Resp ~ Index, 
  family = "binomial", 
  data = Train)
binnedplot(
  fitted(index.model), 
  residuals(index.model, type = "pearson"), 
  nclass = 15)

Looks OK.

infil.model <- glm(
  Resp ~ Infil, family = "binomial", data = Train)
binnedplot(
  fitted(infil.model), 
  residuals(infil.model, type = "pearson"), 
  nclass = 15)

Probably OK?

blasts.model <- glm(
  Resp ~ log(Blasts), 
  family = "binomial", 
  data = Train)
binnedplot(
  fitted(blasts.model), 
  residuals(blasts.model, type = "pearson"), 
  nclass = 15)

Looks OK.

So, we may want to consider adding \(Age^2\)

Choosing combinations of predictors to assess.

Age and temperature are easy variables to collect, so let’s include those, and then consider which other predictors might be important.

As a quick first check, let’s see if any of them are useful:

baseModel <- 
  glm(Resp ~ Age + I(Age^2) + sqrt(Temp - 980),
      data = Train,
      family = "binomial")
fullModel <- glm(
  Resp ~ Age + I(Age^2) + sqrt(Temp - 980) +
         Smear + Index + Infil + log(Blasts), 
  data = Train, family = "binomial")
anova(baseModel, fullModel, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: Resp ~ Age + I(Age^2) + sqrt(Temp - 980)
## Model 2: Resp ~ Age + I(Age^2) + sqrt(Temp - 980) + Smear + Index + Infil + 
##     log(Blasts)
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1        37     44.493                          
## 2        33     23.243  4    21.25 0.0002825 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pretty strong evidence that at least some of those biomarkers are helpful, after controlling for Age and Temperature.

Assessing models using AIC

The AIC is analogous to Mallow’s Cp, and is a “penalized” measure of how well a model fits a dataset.

Since there are four markers in question, there are sixteen (= 2^4) possible combinations of these markers we could include in our model. We could fit all of these and examine their AIC, but to cut down on the number we consider, let’s first see if there is one of these markers that provides all the information we need on its own.

IndexModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index,
  data = Train, family = "binomial"
)
SmearModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Smear,
  data = Train, family = "binomial"
)
InfilModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Infil,
  data = Train, family = "binomial"
)
BlastsModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + log(Blasts),
  data = Train, family = "binomial"
)
AIC(SmearModel)
## [1] 50.91185
AIC(BlastsModel)
## [1] 45.40245
AIC(InfilModel)
## [1] 48.47392
AIC(IndexModel)
## [1] 41.08856
AIC(fullModel)
## [1] 39.24328

The Index marker comes close, but the full model still does a bit better.

Because Index comes so close to capturing everything in the full model, we likely want it included.

Next, let’s fit the various models with Index and one other marker, and see if any of these give us as good or better AIC as the full model.

IndexModelS <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Smear,
  data = Train, family = "binomial"
)
IndexModelI <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Infil,
  data = Train, family = "binomial"
)
IndexModelB <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + log(Blasts),
  data = Train, family = "binomial"
)
AIC(IndexModelS)
## [1] 38.84814
AIC(IndexModelI)
## [1] 35.85037
AIC(IndexModelB)
## [1] 39.9999
AIC(fullModel)
## [1] 39.24328

Looks like Index and Infil together give us a potentially better model (at least, when fitting on this dataset) than including all four markers. But so do Index and Smear. What about all three?

ISIModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Smear + Infil,
  data = Train, family = "binomial"
)
AIC(IndexModel)
## [1] 41.08856
AIC(IndexModelS)
## [1] 38.84814
AIC(IndexModelI)
## [1] 35.85037
AIC(IndexModelB)
## [1] 39.9999
AIC(ISIModel)
## [1] 37.51595
AIC(fullModel)
## [1] 39.24328

Let’s check the residual plots for these models.

binnedplot(
  fitted(IndexModel),
  residuals(IndexModel, type = "pearson"))

binnedplot(
  fitted(IndexModelS),
  residuals(IndexModel, type = "pearson"))

binnedplot(
  fitted(IndexModelI),
  residuals(IndexModel, type = "pearson"))

binnedplot(
  fitted(IndexModelB),
  residuals(IndexModel, type = "pearson"))

binnedplot(
  fitted(ISIModel),
  residuals(IndexModel, type = "pearson"))

binnedplot(
  fitted(fullModel),
  residuals(IndexModel, type = "pearson"))

None of them show particular evidence of curvature.

Checking for possible interactions

One question we might have is whether the relationship between any of the biological markers and the response to treatment is modulated by Age. That is, should we interpret the markers differently depending on the age of the person?

This is equivalent to asking whether we want to include an interaction term between Age and any of the other variables.

AgeInteractionsModel <- glm(
  Resp ~ (sqrt(Temp - 980) + Index + Infil + Smear + Blasts) * Age + I(Age^2),
  family = "binomial", data = Train)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
anova(fullModel, AgeInteractionsModel, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: Resp ~ Age + I(Age^2) + sqrt(Temp - 980) + Smear + Index + Infil + 
##     log(Blasts)
## Model 2: Resp ~ (sqrt(Temp - 980) + Index + Infil + Smear + Blasts) * 
##     Age + I(Age^2)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        33     23.243                       
## 2        28     13.785  5   9.4582  0.09213 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Not much evidence for any interactions there.

Cross-Validation

We’ve settled on a pool of six models. All six include Temp, Age, Age^2 and Index. Then we also include

  1. Nothing else
  2. Infil
  3. Smear
  4. Blasts
  5. Infil and Smear
  6. Infil, Smear and Blasts

Let’s do cross-validation with these six models.

# source("http://colindawson.net/stat213/code/helper_functions.R")
source("~/213_site/code/helper_functions.R")

Some error metrics that make sense for logistic regression:

Normalized deviance: Residual deviance (on validation set) / sample size

We can use normalized.deviance() from helper_functions.R

“Hard” misclassification rate: Count prediction as an “error” if fitted probability is on the “wrong side” of 0.5 and compute proportion of validation cases that are “misclassified”.

Can use hard.classification.error() from helper_functions.R

False positive rate:: What proportion of the true 0s are misclassified as 1s?

False negative rate:: What proportion of real 1s are misclassified as 0s

“Soft” misclassification rate:: Count each point as a “partial error” according to the predicted probability of the wrong category.

Concordance, discordance, and Somers’ D:

We’ll make a list of our six finalists, plus the “base” model (for comparison).

finalists <- list(
  IndexOnly = IndexModel, 
  IndexInfil = IndexModelI,
  IndexSmear = IndexModelS,
  IndexBlasts = IndexModelB,
  IndexSmearInfil = ISIModel,
  AllFour = fullModel,
  None = baseModel)

Now we’ll do 10-fold cross-validation, with 10 random splits for each (to reduce the impact of the choice of folds):

set.seed(8675309)
n <- nrow(Train)
## Normalized deviances (lower is better)
nds <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = normalized.deviance, iterations = 10)
nds
##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##        1.340329        1.318492        1.337272        1.328550 
## IndexSmearInfil         AllFour            None 
##        1.325797        1.342344        1.432144

All six are relatively close compared to the gap between them and the “base” model.

## hard misclassification rate (lower is better)
hces <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = hard.classification.error, iterations = 10)
hces
##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##          0.4540          0.4250          0.4520          0.4515 
## IndexSmearInfil         AllFour            None 
##          0.4290          0.4190          0.5930

The leader here is the model with all four predictors.

Let’s examine the false-positive and false-negative rates separately, since we presumably don’t care equally about false-positive and false-negative predictions.

fprs <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = false.positive.rate, iterations = 10)
fprs
##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.7901667       0.7573148       0.7491667       0.7802778 
## IndexSmearInfil         AllFour            None 
##       0.7652037       0.7116667       1.0000000

Lots of false positives for all models, but fewest for the model that includes all four markers.

What about false negatives?

fnrs <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = false.negative.rate, iterations = 10)
fnrs
##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##               0               0               0               0 
## IndexSmearInfil         AllFour            None 
##               0               0               0

No false negatives. None of the models have any cases of predicting that someone will fail to respond to treatment who actually does respond.

Seems like the models tend to be “optimistic” about responding to treatment overall.

How about “soft misclassification rate”?

sces <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = soft.classification.error, iterations = 10)
sces
##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.4740260       0.4682146       0.4701644       0.4703431 
## IndexSmearInfil         AllFour            None 
##       0.4695217       0.4715015       0.4982308

Interestingly, while “hard misclassification rate” pushed us toward the full model, weighting our misclassifications using the confidence in the predicted outcome leads us toward the models with Index and Infil, possibly with Smear.

Next, concordance, discordance, and Somer’s d:

## Higher is better
concordances <- 
  sapply(
    finalists, do.k.fold.cv, data = Train, num.folds = 10,
    error.metric = concordance, iterations = 10)
concordances
##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.8785417       0.8705688       0.8776620       0.8544676 
## IndexSmearInfil         AllFour            None 
##       0.8546164       0.7901786       0.7696759
## Lower is better
discordances <- 
  sapply(
    finalists, do.k.fold.cv, data = Train, num.folds = 10,
    error.metric = discordance, iterations = 10)
discordances
##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.1368188       0.1570635       0.1367130       0.1699735 
## IndexSmearInfil         AllFour            None 
##       0.1320999       0.2027083       0.2717758
## Higher is better
somers.ds <-
  sapply(
    finalists, do.k.fold.cv, data = Train, num.folds = 10,
    error.metric = somers.d, iterations = 10)
somers.ds  
##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.6787963       0.7703042       0.6848611       0.7555556 
## IndexSmearInfil         AllFour            None 
##       0.7749405       0.6706481       0.4272884

As with the normalized deviance and the soft misclassification rate, The two models with Index and Infil (with and without Smear) look the best.

Which of these we would prefer depends on context; e.g., how invasive and expensive is the procedure to collect the Smear variable, etc., but absent other information, I would tend to opt for the simpler model.

So, the winner is the model with Age, Age^2, Temp, Index and Infil.

Evaluating performance on the test set.

For this final model, let’s calculate our various performance measures on the held out test set.

## Hard misclassification rate
test.hce <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", 
  error.metric = hard.classification.error)
test.hce
## [1] 0.4
## False positive rate
test.fpr <- get.held.out.error(
    IndexModelI, test.set = Test, 
    response.name = "Resp", error.metric = false.positive.rate)
test.fpr
## [1] 0.3333333
## False negative rate
test.fnrs <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", error.metric = false.negative.rate)
test.fnrs
## [1] 0.4285714
## Soft misclassification rate
test.sces <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", error.metric = soft.classification.error)
test.sces
## [1] 0.3749122
## Normalized deviance
test.nd <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", error.metric = normalized.deviance)
test.nd
## [1] 1.918711
test.somers.d <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", error.metric = somers.d)
test.somers.d
## [1] 0.6190476

Fit on the full dataset and interpret

FinalModel <- 
  glm(
    Resp ~ Age + I(Age^2) + sqrt(Temp - 980) + Index + Infil, 
    family = "binomial", data = Leukemia)
FinalModel %>% coef() %>% round(3)
##      (Intercept)              Age         I(Age^2) sqrt(Temp - 980) 
##            6.727           -0.401            0.003           -0.724 
##            Index            Infil 
##            0.401            0.034

Prediction equation:

\[ \widehat{log odds} = 6.7 - 0.4\cdot Age + 0.003 \cdot Age^2 - 0.72 \cdot \sqrt{Temp - 980} + 0.4 \cdot Index + 0.034 \cdot Infil\]

Odds and odds ratios:

FinalModel %>% coef() %>% exp() %>% round(5)
##      (Intercept)              Age         I(Age^2) sqrt(Temp - 980) 
##        834.61136          0.66957          1.00342          0.48484 
##            Index            Infil 
##          1.49323          1.03477

It seems that higher Index scores are associated with a higher probability of responding to treatment, after controlling for age, temperature, and percentage infiltrate, with the odds of response increasing by about 49% with each 1 point increase in Index relative to what is expected based on age and temperature.

Similarly, higher Infil scores are associated with a higher probability of responding to treatment, after controlling for age, temperature, and Index, with the odds of response increasing by about 3.4% for each 1 point difference in Infil.

It seems that higher body temperatures (on a square root scale) are associated with a lower chance of recovery (controlling for age and biomarkers).

For each one unit increase in the square root of temperature in 10ths of a degree above 98F (relative to what is expected given Age, Index and Infil), the odds of responding to treatment go down by 52%.

For patients around 40 years of age, and adjusting for expected differences in biomarkers, an additional 5 years of age is associated with

odds.factor.45vs40 <- 0.66957^(45-40)*1.00342^(45^2-40^2)
odds.factor.45vs40
## [1] 0.5743148

about a 42% decrease in the odds of responding to treatment.

Around 50 years of age, an additional 5 years of age is associated with

odds.factor.55vs50 <- 0.66957^(55-50)*1.00342^(55^2-50^2)
odds.factor.55vs50
## [1] 0.8080259
odds.factor.65vs60 <- 0.66957^(65-60)*1.00342^(65^2-60^2)
odds.factor.65vs60
## [1] 1.136843

However, from 60 to 65, the odds of responding to treatment go back up, after controlling for temperature and biomarkers. This may be because 65-year-olds are expected to have worse biomarker indicators and/or temperature readings than 60-year-olds, and so if a 65-year-old presented with the same indicators as a 60-year-old, it might be an indication that it is a person with a stronger system.

Confidence intervals

Let’s look at the predicted recovery probabilities for a few example patients.

f.hat <- makeFun.logistic(FinalModel)

To see what sensible values to plug in for Index and Infil are, let’s look at a scatterplot and five number summaries for these markers.

fivenum(~Index, data = Leukemia)
## [1]  1.0  6.0  9.0 13.5 20.0
fivenum(~Infil, data = Leukemia)
## [1]  8.0 37.5 61.0 72.5 95.0
gf_point(Index ~ Infil, data = Leukemia)

Patient 1a: 20 years old with relatively high index and infil scores, and a normal body temperature (favorable scenario)

f.hat(Age = 20, Temp = 986, Index = 13.5, Infil = 72.5, interval = "confidence")
##   Age Temp Index Infil    pi.hat       lwr       upr
## 1  20  986  13.5  72.5 0.9979536 0.9066558 0.9999592

We estimate that a young patient with high index and infil scores and a normal body temp has at least a 90% chance of responding to treatment.

Patient 1b: same as Patient 1a but with a fever of 101F

f.hat(Age = 20, Temp = 1010, Index = 13.5, Infil = 72.5, interval = "confidence")
##   Age Temp Index Infil    pi.hat       lwr       upr
## 1  20 1010  13.5  72.5 0.9819743 0.6716373 0.9993112

If a patient like this has a fever, the probability of responding might well still be near 100%, but it might be down to something like 67%.

Patient 2a: 50 years old, moderate index and infil scores (near the median), normal body temp

f.hat(Age = 50, Temp = 986, Index = 9, Infil = 61, interval = "confidence")
##   Age Temp Index Infil    pi.hat       lwr       upr
## 1  50  986     9    61 0.2970462 0.1061308 0.6006274

Patients like this might have anywhere between a 10% and 60% chance of responding.

Patient 2b: Same as 2a but with a fever of 101F

f.hat(Age = 50, Temp = 1010, Index = 9, Infil = 61, interval = "confidence")
##   Age Temp Index Infil     pi.hat         lwr       upr
## 1  50 1010     9    61 0.04507679 0.004957011 0.3090543

The presence of a fever of 101F worsens the outlook for this patient considerably. The probability of responding goes down to somewhere between 0.5% and 31%.

Patient 3: 75 years old, fever of 101F, low index and infil (unfavorable scenario)

f.hat(Age = 75, Temp = 1010, Index = 6, Infil = 37.5, interval = "confidence")
##   Age Temp Index Infil     pi.hat          lwr       upr
## 1  75 1010     6  37.5 0.01209107 0.0004113198 0.2668784

Even if our model is pessimistic, we doubt that the probability of responding to treatment is higher than 27% or so for a patient like this, and more likely it’s less than 1%.