We are trying to select a model to predict Resp
(will a Leukemia patient respond to treatment? (1
= yes, 0
= no)) from a set of diagnostic and vitals results.
Before we do any modeling, let’s set aside a test set.
We’ll restrict our models to the following predictors:
Age
(age in years)Temp
(highest temperature in 10 * degrees F)Smear
(differential percentage of “blasts”)Infil
(percent of marrow leukemia infiltrate)Index
(percentage labeling index of bone marrow leukemia cells)Blasts
(absolute number of blasts in thousands)Note: it’s not a good idea to jump into modeling in a domain that you don’t have any background in, especially when some of the variables are quite technical in nature.
I chose this example because it has a little bit of “messiness”, but not too much; that said, you wouldn’t want to do this for a project unless you know about Leukemia, or at least about these metrics.
We’ll check individual (quantitative) variables’ distributions first to see whether any obviously call for a transformation
The Age
variable is fairly symmetric.
The Temp
variable is kind of right-skewed. However, due to the nature of the scale (no natural “zero point”) the usual transformations that we might use with a right-skewed variable don’t make a lot of sense: A log isn’t sensible, because ratios of degrees Fahrenheit are meaningless.
On balance I think it’s probably better to leave Temp
as is, because any transformation would be somewhat arbitrary, and sacrifice interpretability.
This one isn’t too bad.
Also pretty symmetric.
That also looks pretty symmetric.
That’s a lot of skew. It looks structural, though there may be one genuine outlier.
Let’s remind ourselves what this variable is measuring:
It’s a count, so ratios make sense. We could consider a log transformation, but that will produce infinite values if there are any zeroes. Let’s check (I’m going to use the full dataset here, because if we do a log transform and then there are zeroes in the test set, it will break).
## [1] 0.00 1.00 2.60 9.95 115.00
## (Blasts == 0)
## TRUE FALSE
## 1 50
There’s one zero in the data, so we need to account for this if we take a log.
One thing we could do is add a small count to each value before taking the log. The variable is measured in thousands, and appears to be recorded to one decimal place, which means the counts are rounded to the nearest hundred. So let’s add 0.05, which is the equivalent of 50 blasts, which is the amount of potential “rounding error” inherent in each measurement.
The base of the log doesn’t affect anything except the scaling of the result, so I’ll use base 2 so that one unit represents a doubling of the count, which seems like a natural thing to think about with cells that are dividing.
That looks a lot better. That high value probably isn’t an outlier after all, but rather just a representative of a long right tail in the distribution.
All of the candidate predictors had pretty symmetric distributions, with the exception of Temp
, which we elected not to transform because there wasn’t a natural transformation to do (and the skew wasn’t all that bad to begin with), and Blasts
, which we decided to transform via base 2 log
, after adding a count equal to the amount of potential rounding error in the original variable.
Next, let’s check whether any of the variables are redundant with the others.
library(car)
## Since vif() expects a model, we'll fit a model with all predictors,
## only for the purposes of checking multicollinearity
fullModel <-
glm(Resp ~ Age + Temp + Smear + Infil + Index + log2(Blasts + 0.05),
family = "binomial", data = TrainSet)
vif(fullModel)
## Age Temp Smear
## 1.306507 1.474731 3.136034
## Infil Index log2(Blasts + 0.05)
## 3.443578 1.961914 1.797618
Smear and Infil are fairly predictable from the others, but neither exceeds our VIF > 5 guideline. We won’t preemptively remove anything, but it’s worth keeping in mind that we have some overlap in information.
Age and Temp are easy variables to collect, so let’s start with a baseline model that uses those and none of the biomarker variables, and then consider which of the biomarkers are the most important to include.
As a quick first check, let’s see if any of them are useful:
baseModel <- glm(
formula = Resp ~ Age + Temp,
family = "binomial",
data = TrainSet)
fullModel <- glm(
formula = Resp ~ Age + Temp + Smear + Index + Infil + log2(Blasts + 0.05),
family = "binomial",
data = TrainSet)
## Analysis of Deviance Table
##
## Model 1: Resp ~ Age + Temp
## Model 2: Resp ~ Age + Temp + Smear + Index + Infil + log2(Blasts + 0.05)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 38 49.959
## 2 34 31.711 4 18.248 0.001104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There’s fairly strong evidence that at least some of those biomarkers are helpful, after controlling for Age
and Temp
.
Squared deviations aren’t something we use with a binary response, so \(R^2\) isn’t a relevant measure.
Instead, we can look at AIC, which stands for “Akaike Information Criterion”, which is analogous to Mallow’s \(C_p\), and is a measure of how well a model fits a dataset which is based on the log likelihood, and includes a penalty for complexity. Likr Mallow’s Cp, smaller values are better.
Since there are four markers in question, there are sixteen (= 2^4) possible combinations of these markers we could include in our model. We could fit all of these and examine their AIC, but to cut down on the number we consider, let’s first see if there is one of these markers that provides most of the information we need on its own.
indexModel <- glm(
formula = Resp ~ Age + Temp + Index,
family = "binomial",
data = TrainSet)
smearModel <- glm(
formula = Resp ~ Age + Temp + Smear,
family = "binomial", data = TrainSet)
infilModel <- glm(
formula = Resp ~ Age + Temp + Infil,
family = "binomial", data = TrainSet)
blastsModel <- glm(
formula = Resp ~ Age + Temp + log2(Blasts + 0.05),
family = "binomial",
data = TrainSet)
## [1] 54.75367
## [1] 51.05601
## [1] 53.08415
## [1] 43.67812
## [1] 45.71057
Including Index
marker and no other biomarkers actually gives us an AIC which is better than the one we get using all of the biomarkers.
Because Index yields such a good AIC all by itself, we likely want it included.
Next, let’s fit the various models with Index and one other marker, and see if any of these can improve on the AIC we get from the index-only model.
indexSmearModel <- glm(
formula = Resp ~ Age + Temp + Index + Smear,
family = "binomial",
data = TrainSet)
indexInfilModel <- glm(
formula = Resp ~ Age + Temp + Index + Infil,
family = "binomial",
data = TrainSet)
indexBlastModel <- glm(
formula = Resp ~ Age + Temp + Index + log2(Blasts + 0.05),
family = "binomial",
data = TrainSet)
## [1] 42.55728
## [1] 41.85169
## [1] 44.95675
## [1] 43.67812
## [1] 45.71057
Looks like Index
and Infil
together give us a potentially better model (at least, when fitting on this dataset) than including all four markers. So do Infil
and Smear
together. What about all three?
indexSmearInfilModel <- glm(
formula = Resp ~ Age + Temp + Index + Smear + Infil,
family = "binomial",
data = TrainSet)
## [1] 43.67812
## [1] 42.55728
## [1] 41.85169
## [1] 44.95675
## [1] 43.7402
## [1] 45.71057
Including all three yields a higher AIC than including index and either one of the others. This is consistent with our previous observation that Smear
and Infil
are somewhat multicollinear with other predictors, so it may not be worthwhile to include both together. However, all of these models give lower AICs than the full model.
Let’s check the residual plots for these models.
binnedplot(
x = fitted(indexSmearInfilModel),
y = residuals(indexSmearInfilModel, type = "pearson"))
None of them show particular evidence of curvature.
One question we might have is whether the relationship between any of the biological markers and the response to treatment is different for patients of different ages. That is, should we interpret the markers differently depending on the age of the person?
This is equivalent to asking whether we want to include an interaction term between Age and any of the other variables.
ageInteractionsModel <- glm(
formula = Resp ~ Temp + (Index + Infil + Smear + Blasts) * Age,
family = "binomial",
data = TrainSet)
anova(fullModel, ageInteractionsModel, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: Resp ~ Age + Temp + Smear + Index + Infil + log2(Blasts + 0.05)
## Model 2: Resp ~ Temp + (Index + Infil + Smear + Blasts) * Age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 34 31.711
## 2 30 26.592 4 5.1189 0.2753
## [1] 48.59171
There’s not much evidence that including interaction terms with Age
helps the fit, and the AIC is worse.
We’ve settled on a pool of six models. All six include Temp
, Age
and Index
. Then we also include
Infil
Smear
Blasts
Infil
and Smear
Infil
, Smear
and Blasts
Let’s do cross-validation with these six models.
# source("http://colindawson.net/stat213/code/helper_functions.R")
source("~/www/colinreimerdawson/stat213/code/helper_functions.R")
Some error metrics that make sense for logistic regression:
“Hard” misclassification rate: Count prediction as an “error” if fitted probability is on the “wrong side” of 0.5 and compute proportion of validation cases that are “misclassified”.
hard_classification_error()
from helper_functions.R
False positive rate:: What proportion of the true 0s are misclassified as 1s (this ignores the true 1s)
false_positive_rate()
False negative rate:: What proportion of real 1s are misclassified as 0s (ignores true 0s)
false_negative_rate()
“Soft” misclassification rate:: Count each point as a “partial error” according to the predicted probability assigned to the opposite category.
soft_classification_error()
Normalized deviance: Residual deviance (on validation set), divided by sample size (i.e., “per point” deviance)
normalized_deviance()
Concordance, discordance, and Somers’ D:
We’ll make a list of our six finalists, plus the “base” model (for comparison).
finalists <- list(
IndexOnly = indexModel,
IndexInfil = indexInfilModel,
IndexSmear = indexSmearModel,
IndexBlasts = indexBlastModel,
IndexSmearInfil = indexSmearInfilModel,
AllFour = fullModel,
None = baseModel)
formulaList = sapply(finalists, formula)
Now we’ll do 10-fold cross-validation, with 10 random splits for each (to reduce the impact of the choice of folds):
set.seed(8675309)
normalized_deviances <- sapply(
X = formulaList,
FUN = do_k_fold_cv,
dataset = TrainSet,
K = 10,
error_function = normalized_deviance,
iterations = 10,
family = "binomial")
normalized_deviances
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 1.177679 1.231895 1.172935 1.197182
## IndexSmearInfil AllFour None
## 1.344194 1.400089 1.394247
All six are relatively close compared to the gap between them and the “base” model, but IndexSmear
has the lowest normalized deviance.
set.seed(8675309)
## hard misclassification rate (lower is better)
hard_misclassification_rates <- sapply(
X = formulaList,
FUN = do_k_fold_cv,
dataset = TrainSet,
K = 10,
error_function = hard_classification_error,
iterations = 10,
family = "binomial")
hard_misclassification_rates
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.2605 0.2660 0.2370 0.2615
## IndexSmearInfil AllFour None
## 0.2770 0.2895 0.3450
The model with only Index
and Smear
again has the best result.
Let’s examine the false-positive and false-negative rates separately, since if we were using this model to decide whether treatment was worthwhile, deciding to treat and then getting no response presumably has a different “cost” than deciding not to treat if there would have been a response.
set.seed(8675309)
false_positive_rates <- sapply(
X = formulaList,
FUN = do_k_fold_cv,
dataset = TrainSet,
K = 10,
error_function = false_positive_rate,
iterations = 10,
family = "binomial")
false_positive_rates
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.2524769 0.2625000 0.2652778 0.2391667
## IndexSmearInfil AllFour None
## 0.2433333 0.2921296 0.3678704
This time, the model with Index and Blasts gives the best result.
What about false negatives?
set.seed(8675309)
false_negative_rates <- sapply(
X = formulaList,
FUN = do_k_fold_cv,
dataset = TrainSet,
K = 10,
error_function = false_negative_rate,
iterations = 10,
family = "binomial")
false_negative_rates
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.2799769 0.2610648 0.2258102 0.2690741
## IndexSmearInfil AllFour None
## 0.2906944 0.3001852 0.3103704
Index
and Smear
yields the lowest false negative rate, however.
How about “soft misclassification rate”?
set.seed(8675309)
soft_misclassification_rates <- sapply(
X = formulaList,
FUN = do_k_fold_cv,
dataset = TrainSet,
K = 10,
error_function = soft_classification_error,
iterations = 10,
family = "binomial")
soft_misclassification_rates
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.3242208 0.3043737 0.3087972 0.3235381
## IndexSmearInfil AllFour None
## 0.3183290 0.3318499 0.4484535
Index
and Infil
gives a slightly lower soft misclassification rate than Index
and Smear
, but it’s really close.
set.seed(8675309)
## Higher is better
concordances <- sapply(
X = formulaList,
FUN = do_k_fold_cv,
dataset = TrainSet,
K = 10,
error_function = concordance,
iterations = 10,
family = "binomial")
concordances
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.8358631 0.8699074 0.8256481 0.8372917
## IndexSmearInfil AllFour None
## 0.8335417 0.8068948 0.7021329
The model with Index
and Infil
performs the best on this measure.
set.seed(8675309)
## Lower is better
discordances <- sapply(
X = formulaList,
FUN = do_k_fold_cv,
dataset = TrainSet,
K = 10,
error_function = discordance,
iterations = 10,
family = "binomial")
discordances
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.1641369 0.1300926 0.1743519 0.1627083
## IndexSmearInfil AllFour None
## 0.1664583 0.1931052 0.2978671
and on this one.
set.seed(8675309)
## Higher is better
somers_Ds <- sapply(
X = formulaList,
FUN = do_k_fold_cv,
dataset = TrainSet,
K = 10,
error_function = somers_d,
iterations = 10,
family = "binomial")
somers_Ds
## IndexOnly IndexInfil IndexSmear IndexBlasts
## 0.6717262 0.7398148 0.6512963 0.6745833
## IndexSmearInfil AllFour None
## 0.6670833 0.6137897 0.4042659
Since Somer’s d is calculated by combining concordance and discordance, it makes sense that the model that performed best on both of these separately performs the best here.
Other than false positive rate, which was minimized by using Blasts
, two models consistently perform the best: Index
and Infil
, and Index
and Smear
.
Which of these we would prefer depends on context; e.g., how invasive and expensive is the procedure to collect the Smear
variable compared to the Infil
variable, how heavily we want to weigh false negatives vs being able to “rank” response probabilities, etc.
Whichever model we decide to go with, we should calculate our various performance measures on the held out test set.
metric_list <- list(
"Hard Misclassification" = hard_classification_error,
"Soft Misclassification" = soft_classification_error,
"Normalized Deviance" = normalized_deviance,
"False Positive Rate" = false_positive_rate,
"False Negative Rate" = false_negative_rate,
"Concordance" = concordance,
"Discordance" = discordance,
"Somer's D" = somers_d)
test_metrics <- sapply(
X = metric_list,
FUN = held_out_error,
test_dataset = TestSet,
fitted_model = indexInfilModel)
test_metrics
## Hard Misclassification Soft Misclassification Normalized Deviance
## 0.2000000 0.2847369 0.8863534
## False Positive Rate False Negative Rate Concordance
## 0.2000000 0.2000000 0.8000000
## Discordance Somer's D
## 0.2000000 0.6000000
finalModel <-
glm(
formula = Resp ~ Age + Temp + Index + Infil,
family = "binomial",
data = Leukemia)
finalModel %>%
coef() %>%
round(3)
## (Intercept) Age Temp Index Infil
## 95.568 -0.060 -0.099 0.407 0.034
\[ \widehat{LogOdds(Resp = 1)} = 95.568 - 0.060\cdot Age - 0.099 \cdot Temp + 0.407 \cdot Index + 0.034 \cdot Infil\]
## (Intercept) Age Temp Index Infil
## 3.19528e+41 9.42000e-01 9.05000e-01 1.50200e+00 1.03500e+00
It seems that each year of Age is associated with about a 5.8% decrease in the odds of responding to treatment, controlling for temperature and biomarkers.
Patients with higher temperature readings have lower odds of responding, about 9.5% lower for each tenth of a degree Fahrenheit.
Higher Index
scores are associated with a higher probability of responding to treatment, after controlling for age, temperature, and percentage infiltrate, with the odds of response increasing by about 50% with each 1 point increase in Index relative to what is expected based on the other predictors.
Similarly, higher Infil
scores are associated with a higher probability of responding to treatment, after controlling for the other predictors, with the odds of response increasing by about 3.5% for each 1 point difference.
Let’s look at the predicted recovery probabilities for a few example patients.
To see what sensible values to plug in for Index
and Infil
are, let’s look at a scatterplot and five number summaries for these markers.
## [1] 1.0 6.0 9.0 13.5 20.0
## [1] 8.0 37.5 61.0 72.5 95.0
Patient 1a: 20 years old with relatively high index and infil scores, and a normal body temperature (favorable scenario)
## Age Temp Index Infil pi.hat lwr upr
## 1 20 986 13.5 72.5 0.9863171 0.8388001 0.9989996
We estimate that a patients with high index and infil scores and a normal body temp have at least an 83% chance of responding to treatment (the lower endpoint of the confidence interval)
Patient 1b: same as Patient 1a but with a fever of 101F
## Age Temp Index Infil pi.hat lwr upr
## 1 20 1010 13.5 72.5 0.8688945 0.5108812 0.9767723
If a patient like this has a fever, the probability of responding might still be as high as 97 or 98%, but it might be down to something like 51%.
Patient 2a: 50 years old, moderate index and infil scores (near the median), normal body temp
## Age Temp Index Infil pi.hat lwr upr
## 1 50 986 9 61 0.5615345 0.3450743 0.756861
Patients like this might have anywhere between a 34% and 75% chance of responding; our best guess is that their response chance is around 56%.
Patient 2b: Same as 2a but with a fever of 101F
## Age Temp Index Infil pi.hat lwr upr
## 1 50 1010 9 61 0.1053432 0.02034128 0.40038
The presence of a fever of 101F worsens the outlook for this patient considerably. The probability of responding may be as low as 2%, and our best guess is that it’s around 10%.
Patient 3: 75 years old, fever of 101F, low index and infil (unfavorable scenario)
## Age Temp Index Infil pi.hat lwr upr
## 1 75 1010 6 37.5 0.003443554 9.905594e-05 0.1075632
Even if our model is pessimistic, we doubt that the probability of responding to treatment is higher than 10% or so for a patient like this, and more likely it’s less than 1%.