```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, message = FALSE, results = 'markdown') ``` ## Loading packages and data ```{r} library(mosaic) library(Stat2Data) ``` ```{r} data(Leukemia) ``` ## Splitting into Train and Test We are trying to select a model to predict `Resp` (will a Leukemia patient respond to treatment? (`1` = yes, `0` = no)) from a set of diagnostic and vitals results. Before we do any modeling, let's set aside a test set. ```{r} set.seed(29747) # Don't forget to do this! N <- nrow(Leukemia) testN <- round(0.20 * N) cat("Total N =",N,"; Test N (20%) =",testN) ``` ```{r} HeldOut <- rep(c("no","yes"), times = c(N - testN, testN)) Leukemia <- Leukemia %>% mutate(HeldOut = sample(HeldOut)) Test <- filter(Leukemia, HeldOut == "yes") Train <- filter(Leukemia, HeldOut == "no") head(Test) ``` ## Potential Predictors We'll restrict our models to the following predictors: * `Age` (age in years) * `Temp` (highest temperature in 10 * degrees F) * `Smear` (differential percentage of "blasts") * `Infil` (percent of marrow leukemia infiltrate) * `Index` (percentage labeling index of bone marrow leukemia cells) * `Blasts` (absolute number of blasts in thousands) **Note**: it's not a good idea to jump into modeling in a domain that you don't have any background in, especially when some of the variables are quite technical in nature. I chose this example because it has a little bit of "messiness", but not too much; that said, you wouldn't want to do this for a project unless you know about Leukemia, or at least about these metrics. ## Checking for transformations We'll check individual (quantitative) variables' distributions first to see whether any obviously call for a transformation ### Age ```{r} bwplot(~Age, data = Train) histogram(~Age, data = Train) ``` ### Temperature The `Age` variable is fairly symmetric. ```{r} bwplot(~Temp, data = Train) histogram(~Temp, data = Train) ``` The `Temp` variable is kind of right-skewed ```{r} bwplot(~sqrt(Temp), data = Train) histogram(~sqrt(Temp), data = Train) ``` Square root didn't really help. A log isn't really sensible here, because temperature isn't based on a meaningful zero. That is, ratios of degrees Fahrenheit are meaningless. Square root has a bigger impact near 0, so let's try shifting and then transforming. ```{r} bwplot(~sqrt(Temp - 980), data = Train) histogram(~sqrt(Temp - 980), data = Train) ``` That looks nicer. ### Smear ```{r} bwplot(~Smear, data = Train) histogram(~Smear, data = Train) ``` This one isn't too bad. ### Infil ```{r} bwplot(~Infil, data = Train) histogram(~Infil, data = Train) ``` This one is a bit left skewed, but not extreme. ```{r} bwplot(~Index, data = Train) histogram(~Index, data = Train) ``` Looks pretty reasonable. ```{r} bwplot(~Blasts, data = Train) histogram(~Blasts, data = Train) ``` That's a lot of skew. It looks structural, though there may be one genuine outlier. ```{r} library(tidyverse) Train %>% rownames_to_column() %>% plot(Blasts ~ rowname, data = .) ## Do not run when Knitting ## with(Train, identify(Blasts)) ``` Let's check whether that person still looks like an outlier after a log transformation. However, log(0) is undefined. Are there true zeroes in the data? ```{r} min(~Blasts, data = Train) ``` No true zeroes, so we can take a log. **Note:** It can sometimes make sense to add a small constant to every observation before taking a log, if you have reason to believe that ratio comparisons are the most natural for that variable. ```{r} bwplot(~log(Blasts), data = Train) histogram(~log(Blasts), data = Train) ``` That looks a lot better. That point probably isn't an outlier after all. ## Our pool of variables We have chosen the following transformations: * Age --> Age (no transformation) * Temp --> sqrt(Temp - 980) * Smear --> Smear (no transformation) * Infil --> Infil (no transformation) * Index --> Index (no transformation) * Blasts --> log(Blasts) ## Checking for multicollinearity Next, let's check whether any of the variables are redundant with the others for the purposes of a linear model. ```{r} library(car) ## Since vif() expects a model, we'll fit a model with all predictors, ## only for the purposes of checking multicollinearity fullModel <- glm(Resp ~ Age + sqrt(Temp - 980) + Smear + Infil + Index + log(Blasts), family = "binomial", data = Train) vif(fullModel) ``` Smear and Infil are fairly predictable from the others. Infil exceeds our VIF > 5 guideline. We won't exclude these entirely, because they may not be predictable if we remove some other variables, but it's something to keep in mind. ## Looking at single predictor models The main purpose of this step is to assess whether we might want any polynomial terms. ```{r} library(arm) age.model <- glm(Resp ~ Age, family = "binomial", data = Train) binnedplot( fitted(age.model), residuals(age.model, type = "pearson"), nclass = 15) ``` There might be a bit of curvature there. ```{r} temp.model <- glm( Resp ~ sqrt(Temp - 980), family = "binomial", data = Train) binnedplot( fitted(temp.model), residuals(temp.model, type = "pearson"), nclass = 15) ``` This one looks OKish. ```{r} smear.model <- glm( Resp ~ Smear, family = "binomial", data = Train) binnedplot( fitted(smear.model), residuals(smear.model, type = "pearson"), nclass = 15) ``` That one is a tough call. ```{r} index.model <- glm( Resp ~ Index, family = "binomial", data = Train) binnedplot( fitted(index.model), residuals(index.model, type = "pearson"), nclass = 15) ``` Looks OK. ```{r} infil.model <- glm( Resp ~ Infil, family = "binomial", data = Train) binnedplot( fitted(infil.model), residuals(infil.model, type = "pearson"), nclass = 15) ``` Probably OK? ```{r} blasts.model <- glm( Resp ~ log(Blasts), family = "binomial", data = Train) binnedplot( fitted(blasts.model), residuals(blasts.model, type = "pearson"), nclass = 15) ``` Looks OK. So, we may want to consider adding $Age^2$ ## Choosing combinations of predictors to assess. Age and temperature are easy variables to collect, so let's include those, and then consider which other predictors might be important. As a quick first check, let's see if any of them are useful: ```{r} baseModel <- glm(Resp ~ Age + I(Age^2) + sqrt(Temp - 980), data = Train, family = "binomial") fullModel <- glm( Resp ~ Age + I(Age^2) + sqrt(Temp - 980) + Smear + Index + Infil + log(Blasts), data = Train, family = "binomial") ``` ```{r} anova(baseModel, fullModel, test = "LRT") ``` Pretty strong evidence that at least some of those biomarkers are helpful, after controlling for Age and Temperature. ## Assessing models using AIC The AIC is analogous to Mallow's Cp, and is a "penalized" measure of how well a model fits a dataset. Since there are four markers in question, there are sixteen (= 2^4) possible combinations of these markers we could include in our model. We could fit all of these and examine their AIC, but to cut down on the number we consider, let's first see if there is one of these markers that provides all the information we need on its own. ```{r} IndexModel <- glm( Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index, data = Train, family = "binomial" ) SmearModel <- glm( Resp ~ poly(Age,2) + sqrt(Temp - 980) + Smear, data = Train, family = "binomial" ) InfilModel <- glm( Resp ~ poly(Age,2) + sqrt(Temp - 980) + Infil, data = Train, family = "binomial" ) BlastsModel <- glm( Resp ~ poly(Age,2) + sqrt(Temp - 980) + log(Blasts), data = Train, family = "binomial" ) ``` ```{r} AIC(SmearModel) AIC(BlastsModel) AIC(InfilModel) AIC(IndexModel) AIC(fullModel) ``` The `Index` marker comes close, but the full model still does a bit better. Because Index comes so close to capturing everything in the full model, we likely want it included. Next, let's fit the various models with Index and one other marker, and see if any of these give us as good or better AIC as the full model. ```{r} IndexModelS <- glm( Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Smear, data = Train, family = "binomial" ) IndexModelI <- glm( Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Infil, data = Train, family = "binomial" ) IndexModelB <- glm( Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + log(Blasts), data = Train, family = "binomial" ) ``` ```{r} AIC(IndexModelS) AIC(IndexModelI) AIC(IndexModelB) AIC(fullModel) ``` Looks like `Index` and `Infil` together give us a potentially better model (at least, when fitting on this dataset) than including all four markers. But so do `Index` and `Smear`. What about all three? ```{r} ISIModel <- glm( Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Smear + Infil, data = Train, family = "binomial" ) ``` ```{r} AIC(IndexModel) AIC(IndexModelS) AIC(IndexModelI) AIC(IndexModelB) AIC(ISIModel) AIC(fullModel) ``` ## Let's check the residual plots for these models. ```{r} binnedplot( fitted(IndexModel), residuals(IndexModel, type = "pearson")) binnedplot( fitted(IndexModelS), residuals(IndexModel, type = "pearson")) binnedplot( fitted(IndexModelI), residuals(IndexModel, type = "pearson")) binnedplot( fitted(IndexModelB), residuals(IndexModel, type = "pearson")) binnedplot( fitted(ISIModel), residuals(IndexModel, type = "pearson")) binnedplot( fitted(fullModel), residuals(IndexModel, type = "pearson")) ``` None of them show particular evidence of curvature. ## Checking for possible interactions One question we might have is whether the relationship between any of the biological markers and the response to treatment is modulated by Age. That is, should we interpret the markers differently depending on the age of the person? This is equivalent to asking whether we want to include an interaction term between Age and any of the other variables. ```{r} AgeInteractionsModel <- glm( Resp ~ (sqrt(Temp - 980) + Index + Infil + Smear + Blasts) * Age + I(Age^2), family = "binomial", data = Train) anova(fullModel, AgeInteractionsModel, test = "LRT") ``` Not much evidence for any interactions there. ## Cross-Validation We've settled on a pool of six models. All six include `Temp`, `Age`, `Age^2` and `Index`. Then we also include 1. Nothing else 2. `Infil` 3. `Smear` 4. `Blasts` 5. `Infil` and `Smear` 6. `Infil`, `Smear` and `Blasts` Let's do cross-validation with these six models. ```{r} # source("http://colindawson.net/stat213/code/helper_functions.R") source("~/213_site/code/helper_functions.R") ``` Some error metrics that make sense for logistic regression: **Normalized deviance**: Residual deviance (on validation set) / sample size We can use `normalized.deviance()` from `helper_functions.R` **"Hard" misclassification rate**: Count prediction as an "error" if fitted probability is on the "wrong side" of 0.5 and compute proportion of validation cases that are "misclassified". Can use `hard.classification.error()` from `helper_functions.R` **False positive rate:**: What proportion of the true 0s are misclassified as 1s? * This ignores the true 1s * Use `false.positive.rate()` **False negative rate:**: What proportion of real 1s are misclassified as 0s * Ignores true 0s * Use `false.negative.rate()` **"Soft" misclassification rate:**: Count each point as a "partial error" according to the predicted probability of the wrong category. * Rewards confident correct predictions and penalizes confident wrong predictions, with "uncertain" predictions (near 0.5) in between. * Compute average error per case in validation set * Use `soft.classification.error()` **Concordance, discordance, and Somers' D:** * These measures consider all pairs of cases with different response values (that is, one 0 and one 1) * Count what proportion of the time the 1 is assigned a higher predicted prob. than the zero (good). This is the **"concordance"**. * Count what proportion of the time the opposite occurs. This is the **"discordance"** (bad) * Note that these may not add to 1, due to possible tied predicted probs. * **Somers' D** := Concordance - Discordance We'll make a list of our six finalists, plus the "base" model (for comparison). ```{r} finalists <- list( IndexOnly = IndexModel, IndexInfil = IndexModelI, IndexSmear = IndexModelS, IndexBlasts = IndexModelB, IndexSmearInfil = ISIModel, AllFour = fullModel, None = baseModel) ``` Now we'll do 10-fold cross-validation, with 10 random splits for each (to reduce the impact of the choice of folds): ```{r} set.seed(8675309) n <- nrow(Train) ## Normalized deviances (lower is better) nds <- sapply( finalists, do.k.fold.cv, data = Train, num.folds = 10, error.metric = normalized.deviance, iterations = 10) nds ``` All six are relatively close compared to the gap between them and the "base" model. ```{r} ## hard misclassification rate (lower is better) hces <- sapply( finalists, do.k.fold.cv, data = Train, num.folds = 10, error.metric = hard.classification.error, iterations = 10) hces ``` The leader here is the model with all four predictors. Let's examine the false-positive and false-negative rates separately, since we presumably don't care equally about false-positive and false-negative predictions. ```{r} fprs <- sapply( finalists, do.k.fold.cv, data = Train, num.folds = 10, error.metric = false.positive.rate, iterations = 10) fprs ``` Lots of false positives for all models, but fewest for the model that includes all four markers. What about false negatives? ```{r} fnrs <- sapply( finalists, do.k.fold.cv, data = Train, num.folds = 10, error.metric = false.negative.rate, iterations = 10) fnrs ``` No false negatives. None of the models have any cases of predicting that someone will fail to respond to treatment who actually does respond. Seems like the models tend to be "optimistic" about responding to treatment overall. How about "soft misclassification rate"? ```{r} sces <- sapply( finalists, do.k.fold.cv, data = Train, num.folds = 10, error.metric = soft.classification.error, iterations = 10) sces ``` Interestingly, while "hard misclassification rate" pushed us toward the full model, weighting our misclassifications using the confidence in the predicted outcome leads us toward the models with `Index` and `Infil`, possibly with `Smear`. Next, concordance, discordance, and Somer's d: ```{r} ## Higher is better concordances <- sapply( finalists, do.k.fold.cv, data = Train, num.folds = 10, error.metric = concordance, iterations = 10) concordances ``` ```{r} ## Lower is better discordances <- sapply( finalists, do.k.fold.cv, data = Train, num.folds = 10, error.metric = discordance, iterations = 10) discordances ``` ```{r} ## Higher is better somers.ds <- sapply( finalists, do.k.fold.cv, data = Train, num.folds = 10, error.metric = somers.d, iterations = 10) somers.ds ``` As with the normalized deviance and the soft misclassification rate, The two models with `Index` and `Infil` (with and without `Smear`) look the best. Which of these we would prefer depends on context; e.g., how invasive and expensive is the procedure to collect the `Smear` variable, etc., but absent other information, I would tend to opt for the simpler model. So, the winner is the model with `Age`, `Age^2`, `Temp`, `Index` and `Infil`. ## Evaluating performance on the test set. For this final model, let's calculate our various performance measures on the held out test set. ```{r} ## Hard misclassification rate test.hce <- get.held.out.error( IndexModelI, test.set = Test, response.name = "Resp", error.metric = hard.classification.error) test.hce ``` ```{r} ## False positive rate test.fpr <- get.held.out.error( IndexModelI, test.set = Test, response.name = "Resp", error.metric = false.positive.rate) test.fpr ``` ```{r} ## False negative rate test.fnrs <- get.held.out.error( IndexModelI, test.set = Test, response.name = "Resp", error.metric = false.negative.rate) test.fnrs ``` ```{r} ## Soft misclassification rate test.sces <- get.held.out.error( IndexModelI, test.set = Test, response.name = "Resp", error.metric = soft.classification.error) test.sces ``` ```{r} ## Normalized deviance test.nd <- get.held.out.error( IndexModelI, test.set = Test, response.name = "Resp", error.metric = normalized.deviance) test.nd ``` ```{r} test.somers.d <- get.held.out.error( IndexModelI, test.set = Test, response.name = "Resp", error.metric = somers.d) test.somers.d ``` ## Fit on the full dataset and interpret ```{r} FinalModel <- glm( Resp ~ Age + I(Age^2) + sqrt(Temp - 980) + Index + Infil, family = "binomial", data = Leukemia) FinalModel %>% coef() %>% round(3) ``` ### Prediction equation: $$ \widehat{log odds} = 6.7 - 0.4\cdot Age + 0.003 \cdot Age^2 - 0.72 \cdot \sqrt{Temp - 980} + 0.4 \cdot Index + 0.034 \cdot Infil$$ ### Odds and odds ratios: ```{r} FinalModel %>% coef() %>% exp() %>% round(5) ``` It seems that higher Index scores are associated with a higher probability of responding to treatment, after controlling for age, temperature, and percentage infiltrate, with the odds of response increasing by about 49% with each 1 point increase in Index relative to what is expected based on age and temperature. Similarly, higher `Infil` scores are associated with a higher probability of responding to treatment, after controlling for age, temperature, and `Index`, with the odds of response increasing by about 3.4% for each 1 point difference in `Infil`. It seems that higher body temperatures (on a square root scale) are associated with a lower chance of recovery (controlling for age and biomarkers). For each one unit increase in the square root of temperature in 10ths of a degree above 98F (relative to what is expected given Age, Index and Infil), the odds of responding to treatment go down by 52%. For patients around 40 years of age, and adjusting for expected differences in biomarkers, an additional 5 years of age is associated with ```{r} odds.factor.45vs40 <- 0.66957^(45-40)*1.00342^(45^2-40^2) odds.factor.45vs40 ``` about a 42% decrease in the odds of responding to treatment. Around 50 years of age, an additional 5 years of age is associated with ```{r} odds.factor.55vs50 <- 0.66957^(55-50)*1.00342^(55^2-50^2) odds.factor.55vs50 ``` ```{r} odds.factor.65vs60 <- 0.66957^(65-60)*1.00342^(65^2-60^2) odds.factor.65vs60 ``` However, from 60 to 65, the odds of responding to treatment go back up, after controlling for temperature and biomarkers. This may be because 65-year-olds are expected to have worse biomarker indicators and/or temperature readings than 60-year-olds, and so if a 65-year-old presented with the same indicators as a 60-year-old, it might be an indication that it is a person with a stronger system. ## Confidence intervals Let's look at the predicted recovery probabilities for a few example patients. ```{r} f.hat <- makeFun.logistic(FinalModel) ``` To see what sensible values to plug in for `Index` and `Infil` are, let's look at a scatterplot and five number summaries for these markers. ```{r} fivenum(~Index, data = Leukemia) fivenum(~Infil, data = Leukemia) ``` ```{r} gf_point(Index ~ Infil, data = Leukemia) ``` **Patient 1a: 20 years old with relatively high index and infil scores, and a normal body temperature** (favorable scenario) ```{r} f.hat(Age = 20, Temp = 986, Index = 13.5, Infil = 72.5, interval = "confidence") ``` We estimate that a young patient with high index and infil scores and a normal body temp has at least a 90% chance of responding to treatment. **Patient 1b: same as Patient 1a but with a fever of 101F** ```{r} f.hat(Age = 20, Temp = 1010, Index = 13.5, Infil = 72.5, interval = "confidence") ``` If a patient like this has a fever, the probability of responding might well still be near 100%, but it might be down to something like 67%. **Patient 2a: 50 years old, moderate index and infil scores (near the median), normal body temp** ```{r} f.hat(Age = 50, Temp = 986, Index = 9, Infil = 61, interval = "confidence") ``` Patients like this might have anywhere between a 10% and 60% chance of responding. **Patient 2b: Same as 2a but with a fever of 101F** ```{r} f.hat(Age = 50, Temp = 1010, Index = 9, Infil = 61, interval = "confidence") ``` The presence of a fever of 101F worsens the outlook for this patient considerably. The probability of responding goes down to somewhere between 0.5% and 31%. **Patient 3: 75 years old, fever of 101F, low index and infil (unfavorable scenario)** ```{r} f.hat(Age = 75, Temp = 1010, Index = 6, Infil = 37.5, interval = "confidence") ``` Even if our model is pessimistic, we doubt that the probability of responding to treatment is higher than 27% or so for a patient like this, and more likely it's less than 1%.