Splitting into Train and Test

We are trying to select a model to predict Resp (will a Leukemia patient respond to treatment? (1 = yes, 0 = no)) from a set of diagnostic and vitals results.

Before we do any modeling, let’s set aside a test set.

Potential Predictors

We’ll restrict our models to the following predictors:

  • Age (age in years)
  • Temp (highest temperature in 10 * degrees F)
  • Smear (differential percentage of “blasts”)
  • Infil (percent of marrow leukemia infiltrate)
  • Index (percentage labeling index of bone marrow leukemia cells)
  • Blasts (absolute number of blasts in thousands)

Note: it’s not a good idea to jump into modeling in a domain that you don’t have any background in, especially when some of the variables are quite technical in nature.

I chose this example because it has a little bit of “messiness”, but not too much; that said, you wouldn’t want to do this for a project unless you know about Leukemia, or at least about these metrics.

Checking for potential transformations

We’ll check individual (quantitative) variables’ distributions first to see whether any obviously call for a transformation

Temperature

The Temp variable is kind of right-skewed. However, due to the nature of the scale (no natural “zero point”) the usual transformations that we might use with a right-skewed variable don’t make a lot of sense: A log isn’t sensible, because ratios of degrees Fahrenheit are meaningless.

On balance I think it’s probably better to leave Temp as is, because any transformation would be somewhat arbitrary, and sacrifice interpretability.

Blasts

That’s a lot of skew. It looks structural, though there may be one genuine outlier.

Let’s remind ourselves what this variable is measuring:

It’s a count, so ratios make sense. We could consider a log transformation, but that will produce infinite values if there are any zeroes. Let’s check (I’m going to use the full dataset here, because if we do a log transform and then there are zeroes in the test set, it will break).

## [1]   0.00   1.00   2.60   9.95 115.00
## (Blasts == 0)
##  TRUE FALSE 
##     1    50

There’s one zero in the data, so we need to account for this if we take a log.

One thing we could do is add a small count to each value before taking the log. The variable is measured in thousands, and appears to be recorded to one decimal place, which means the counts are rounded to the nearest hundred. So let’s add 0.05, which is the equivalent of 50 blasts, which is the amount of potential “rounding error” inherent in each measurement.

The base of the log doesn’t affect anything except the scaling of the result, so I’ll use base 2 so that one unit represents a doubling of the count, which seems like a natural thing to think about with cells that are dividing.

That looks a lot better. That high value probably isn’t an outlier after all, but rather just a representative of a long right tail in the distribution.

Our pool of variables

All of the candidate predictors had pretty symmetric distributions, with the exception of Temp, which we elected not to transform because there wasn’t a natural transformation to do (and the skew wasn’t all that bad to begin with), and Blasts, which we decided to transform via base 2 log, after adding a count equal to the amount of potential rounding error in the original variable.

  • Age –> Age (no transformation)
  • Temp –> Temp (no transformation)
  • Smear –> Smear (no transformation)
  • Infil –> Infil (no transformation)
  • Index –> Index (no transformation)
  • Blasts –> log2(Blasts + 0.05)

Checking for multicollinearity

Next, let’s check whether any of the variables are redundant with the others.

##                 Age                Temp               Smear 
##            1.306507            1.474731            3.136034 
##               Infil               Index log2(Blasts + 0.05) 
##            3.443578            1.961914            1.797618

Smear and Infil are fairly predictable from the others, but neither exceeds our VIF > 5 guideline. We won’t preemptively remove anything, but it’s worth keeping in mind that we have some overlap in information.

Choosing combinations of predictors to assess.

Age and Temp are easy variables to collect, so let’s start with a baseline model that uses those and none of the biomarker variables, and then consider which of the biomarkers are the most important to include.

As a quick first check, let’s see if any of them are useful:

## Analysis of Deviance Table
## 
## Model 1: Resp ~ Age + Temp
## Model 2: Resp ~ Age + Temp + Smear + Index + Infil + log2(Blasts + 0.05)
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1        38     49.959                        
## 2        34     31.711  4   18.248 0.001104 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There’s fairly strong evidence that at least some of those biomarkers are helpful, after controlling for Age and Temp.

Assessing models using AIC

Squared deviations aren’t something we use with a binary response, so \(R^2\) isn’t a relevant measure.

Instead, we can look at AIC, which stands for “Akaike Information Criterion”, which is analogous to Mallow’s \(C_p\), and is a measure of how well a model fits a dataset which is based on the log likelihood, and includes a penalty for complexity. Likr Mallow’s Cp, smaller values are better.

Since there are four markers in question, there are sixteen (= 2^4) possible combinations of these markers we could include in our model. We could fit all of these and examine their AIC, but to cut down on the number we consider, let’s first see if there is one of these markers that provides most of the information we need on its own.

## [1] 54.75367
## [1] 51.05601
## [1] 53.08415
## [1] 43.67812
## [1] 45.71057

Including Index marker and no other biomarkers actually gives us an AIC which is better than the one we get using all of the biomarkers.

Because Index yields such a good AIC all by itself, we likely want it included.

Next, let’s fit the various models with Index and one other marker, and see if any of these can improve on the AIC we get from the index-only model.

## [1] 42.55728
## [1] 41.85169
## [1] 44.95675
## [1] 43.67812
## [1] 45.71057

Looks like Index and Infil together give us a potentially better model (at least, when fitting on this dataset) than including all four markers. So do Infil and Smear together. What about all three?

## [1] 43.67812
## [1] 42.55728
## [1] 41.85169
## [1] 44.95675
## [1] 43.7402
## [1] 45.71057

Including all three yields a higher AIC than including index and either one of the others. This is consistent with our previous observation that Smear and Infil are somewhat multicollinear with other predictors, so it may not be worthwhile to include both together. However, all of these models give lower AICs than the full model.

Let’s check the residual plots for these models.

None of them show particular evidence of curvature.

Checking for possible interactions

One question we might have is whether the relationship between any of the biological markers and the response to treatment is different for patients of different ages. That is, should we interpret the markers differently depending on the age of the person?

This is equivalent to asking whether we want to include an interaction term between Age and any of the other variables.

## Analysis of Deviance Table
## 
## Model 1: Resp ~ Age + Temp + Smear + Index + Infil + log2(Blasts + 0.05)
## Model 2: Resp ~ Temp + (Index + Infil + Smear + Blasts) * Age
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1        34     31.711                     
## 2        30     26.592  4   5.1189   0.2753
## [1] 48.59171

There’s not much evidence that including interaction terms with Age helps the fit, and the AIC is worse.

Cross-Validation

We’ve settled on a pool of six models. All six include Temp, Age and Index. Then we also include

  1. Nothing else
  2. Infil
  3. Smear
  4. Blasts
  5. Infil and Smear
  6. Infil, Smear and Blasts

Let’s do cross-validation with these six models.

Some error metrics that make sense for logistic regression:

“Hard” misclassification rate: Count prediction as an “error” if fitted probability is on the “wrong side” of 0.5 and compute proportion of validation cases that are “misclassified”.

  • Can use hard_classification_error() from helper_functions.R

False positive rate:: What proportion of the true 0s are misclassified as 1s (this ignores the true 1s)

  • Use false_positive_rate()

False negative rate:: What proportion of real 1s are misclassified as 0s (ignores true 0s)

  • Use false_negative_rate()

“Soft” misclassification rate:: Count each point as a “partial error” according to the predicted probability assigned to the opposite category.

  • Rewards confident correct predictions and penalizes confident wrong predictions, with “uncertain” predictions (near 0.5) in between.
  • Computes average error per case in validation set
  • Use soft_classification_error()

Normalized deviance: Residual deviance (on validation set), divided by sample size (i.e., “per point” deviance)

  • Use normalized_deviance()

Concordance, discordance, and Somers’ D:

  • These measures consider all pairs of cases with distinct response values (that is, one 0 and one 1)
  • Count what proportion of the time the 1 is assigned a higher predicted prob. than the zero (good). This is the “concordance”.
  • Count what proportion of the time the opposite occurs. This is the “discordance” (bad)
  • Note that these may not add to 1, since if the predicted probability is the same, then it counts toward neither measure
  • Somers’ D := (Concordance - Discordance) / TotalPairs

We’ll make a list of our six finalists, plus the “base” model (for comparison).

Now we’ll do 10-fold cross-validation, with 10 random splits for each (to reduce the impact of the choice of folds):

##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##        1.177679        1.231895        1.172935        1.197182 
## IndexSmearInfil         AllFour            None 
##        1.344194        1.400089        1.394247

All six are relatively close compared to the gap between them and the “base” model, but IndexSmear has the lowest normalized deviance.

##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##          0.2605          0.2660          0.2370          0.2615 
## IndexSmearInfil         AllFour            None 
##          0.2770          0.2895          0.3450

The model with only Index and Smear again has the best result.

Let’s examine the false-positive and false-negative rates separately, since if we were using this model to decide whether treatment was worthwhile, deciding to treat and then getting no response presumably has a different “cost” than deciding not to treat if there would have been a response.

##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.2524769       0.2625000       0.2652778       0.2391667 
## IndexSmearInfil         AllFour            None 
##       0.2433333       0.2921296       0.3678704

This time, the model with Index and Blasts gives the best result.

What about false negatives?

##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.2799769       0.2610648       0.2258102       0.2690741 
## IndexSmearInfil         AllFour            None 
##       0.2906944       0.3001852       0.3103704

Index and Smear yields the lowest false negative rate, however.

How about “soft misclassification rate”?

##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.3242208       0.3043737       0.3087972       0.3235381 
## IndexSmearInfil         AllFour            None 
##       0.3183290       0.3318499       0.4484535

Index and Infil gives a slightly lower soft misclassification rate than Index and Smear, but it’s really close.

##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.8358631       0.8699074       0.8256481       0.8372917 
## IndexSmearInfil         AllFour            None 
##       0.8335417       0.8068948       0.7021329

The model with Index and Infil performs the best on this measure.

##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.1641369       0.1300926       0.1743519       0.1627083 
## IndexSmearInfil         AllFour            None 
##       0.1664583       0.1931052       0.2978671

and on this one.

##       IndexOnly      IndexInfil      IndexSmear     IndexBlasts 
##       0.6717262       0.7398148       0.6512963       0.6745833 
## IndexSmearInfil         AllFour            None 
##       0.6670833       0.6137897       0.4042659

Since Somer’s d is calculated by combining concordance and discordance, it makes sense that the model that performed best on both of these separately performs the best here.

Conclusion

Other than false positive rate, which was minimized by using Blasts, two models consistently perform the best: Index and Infil, and Index and Smear.

Which of these we would prefer depends on context; e.g., how invasive and expensive is the procedure to collect the Smear variable compared to the Infil variable, how heavily we want to weigh false negatives vs being able to “rank” response probabilities, etc.

Evaluating performance on the test set.

Whichever model we decide to go with, we should calculate our various performance measures on the held out test set.

## Hard Misclassification Soft Misclassification    Normalized Deviance 
##              0.2000000              0.2847369              0.8863534 
##    False Positive Rate    False Negative Rate            Concordance 
##              0.2000000              0.2000000              0.8000000 
##            Discordance              Somer's D 
##              0.2000000              0.6000000

Fit on the full dataset and interpret

## (Intercept)         Age        Temp       Index       Infil 
##      95.568      -0.060      -0.099       0.407       0.034

Prediction equation:

\[ \widehat{LogOdds(Resp = 1)} = 95.568 - 0.060\cdot Age - 0.099 \cdot Temp + 0.407 \cdot Index + 0.034 \cdot Infil\]

Odds and odds ratios:

## (Intercept)         Age        Temp       Index       Infil 
## 3.19528e+41 9.42000e-01 9.05000e-01 1.50200e+00 1.03500e+00

It seems that each year of Age is associated with about a 5.8% decrease in the odds of responding to treatment, controlling for temperature and biomarkers.

Patients with higher temperature readings have lower odds of responding, about 9.5% lower for each tenth of a degree Fahrenheit.

Higher Index scores are associated with a higher probability of responding to treatment, after controlling for age, temperature, and percentage infiltrate, with the odds of response increasing by about 50% with each 1 point increase in Index relative to what is expected based on the other predictors.

Similarly, higher Infil scores are associated with a higher probability of responding to treatment, after controlling for the other predictors, with the odds of response increasing by about 3.5% for each 1 point difference.

Confidence intervals

Let’s look at the predicted recovery probabilities for a few example patients.

To see what sensible values to plug in for Index and Infil are, let’s look at a scatterplot and five number summaries for these markers.

## [1]  1.0  6.0  9.0 13.5 20.0
## [1]  8.0 37.5 61.0 72.5 95.0

Patient 1a: 20 years old with relatively high index and infil scores, and a normal body temperature (favorable scenario)

##   Age Temp Index Infil    pi.hat       lwr       upr
## 1  20  986  13.5  72.5 0.9863171 0.8388001 0.9989996

We estimate that a patients with high index and infil scores and a normal body temp have at least an 83% chance of responding to treatment (the lower endpoint of the confidence interval)

Patient 1b: same as Patient 1a but with a fever of 101F

##   Age Temp Index Infil    pi.hat       lwr       upr
## 1  20 1010  13.5  72.5 0.8688945 0.5108812 0.9767723

If a patient like this has a fever, the probability of responding might still be as high as 97 or 98%, but it might be down to something like 51%.

Patient 2a: 50 years old, moderate index and infil scores (near the median), normal body temp

##   Age Temp Index Infil    pi.hat       lwr      upr
## 1  50  986     9    61 0.5615345 0.3450743 0.756861

Patients like this might have anywhere between a 34% and 75% chance of responding; our best guess is that their response chance is around 56%.

Patient 2b: Same as 2a but with a fever of 101F

##   Age Temp Index Infil    pi.hat        lwr     upr
## 1  50 1010     9    61 0.1053432 0.02034128 0.40038

The presence of a fever of 101F worsens the outlook for this patient considerably. The probability of responding may be as low as 2%, and our best guess is that it’s around 10%.

Patient 3: 75 years old, fever of 101F, low index and infil (unfavorable scenario)

##   Age Temp Index Infil      pi.hat          lwr       upr
## 1  75 1010     6  37.5 0.003443554 9.905594e-05 0.1075632

Even if our model is pessimistic, we doubt that the probability of responding to treatment is higher than 10% or so for a patient like this, and more likely it’s less than 1%.