```{r setup, include=FALSE}
knitr::opts_chunk$set(
  echo = TRUE,
  message = FALSE,
  results = 'markdown')
```

## Loading packages and data

```{r}
library(mosaic)
library(Stat2Data)
```

```{r}
data(Leukemia)
```

## Splitting into Train and Test

We are trying to select a model to predict
`Resp` (will a Leukemia patient respond to treatment? (`1` = yes, `0` = no))
from a set of diagnostic and vitals results.

Before we do any modeling, let's set aside a test set.
```{r}
set.seed(29747) # Don't forget to do this!
N <- nrow(Leukemia)
testN <- round(0.20 * N)
cat("Total N =",N,"; Test N (20%) =",testN)
```

```{r}
HeldOut <- rep(c("no","yes"), times = c(N - testN, testN))
Leukemia <- Leukemia %>%
  mutate(HeldOut = sample(HeldOut))
Test <- filter(Leukemia, HeldOut == "yes")
Train <- filter(Leukemia, HeldOut == "no")
head(Test)
```

## Potential Predictors

We'll restrict our models to the following predictors:

* `Age` (age in years)
* `Temp` (highest temperature in 10 * degrees F)
* `Smear` (differential percentage of "blasts")
* `Infil` (percent of marrow leukemia infiltrate)
* `Index` (percentage labeling index of bone marrow leukemia cells)
* `Blasts` (absolute number of blasts in thousands)

**Note**: it's not a good idea to jump into modeling in a domain
that you don't have any background in, especially when some of
the variables are quite technical in nature.  

I chose this example because it has a little bit of "messiness", 
but not too much; that said, you wouldn't want to do this for a 
project unless you know about Leukemia, or at least about these metrics.

## Checking for transformations

We'll check individual (quantitative) variables' distributions first to see whether any obviously call for a transformation

### Age
```{r}
bwplot(~Age, data = Train) 
histogram(~Age, data = Train)
```

### Temperature

The `Age` variable is fairly symmetric.

```{r}
bwplot(~Temp, data = Train)                
histogram(~Temp, data = Train)
```

The `Temp` variable is kind of right-skewed

```{r}
bwplot(~sqrt(Temp), data = Train)          
histogram(~sqrt(Temp), data = Train)
```

Square root didn't really help.

A log isn't really sensible here, because temperature isn't based
on a meaningful zero. That is, ratios of degrees Fahrenheit are
meaningless.

Square root has a bigger impact near 0, so let's try shifting and then transforming.

```{r}
bwplot(~sqrt(Temp - 980), data = Train)    
histogram(~sqrt(Temp - 980), data = Train)
```

That looks nicer.

### Smear

```{r}
bwplot(~Smear, data = Train)               
histogram(~Smear, data = Train)
```

This one isn't too bad.


### Infil

```{r}
bwplot(~Infil, data = Train)               
histogram(~Infil, data = Train)
```

This one is a bit left skewed, but not extreme.

```{r}
bwplot(~Index, data = Train)               
histogram(~Index, data = Train)
```

Looks pretty reasonable.

```{r}
bwplot(~Blasts, data = Train)              
histogram(~Blasts, data = Train)           
```

That's a lot of skew. It looks structural, though there may be one 
genuine outlier.

```{r}
library(tidyverse)
Train %>%
  rownames_to_column() %>%
  plot(Blasts ~ rowname, data = .)
## Do not run when Knitting
## with(Train, identify(Blasts))
```

Let's check whether that person still looks like an outlier after
a log transformation.

However, log(0) is undefined. Are there true zeroes in the data?
```{r}
min(~Blasts, data = Train)
```

No true zeroes, so we can take a log.

**Note:** It can sometimes make sense to add a small constant to
every observation before taking a log, if you have reason to believe
that ratio comparisons are the most natural for that variable.

```{r}
bwplot(~log(Blasts), data = Train)
histogram(~log(Blasts), data = Train)
```

That looks a lot better. That point probably isn't an outlier after all.

## Our pool of variables

We have chosen the following transformations:
* Age --> Age (no transformation)
* Temp --> sqrt(Temp - 980) 
* Smear --> Smear (no transformation)
* Infil --> Infil (no transformation)
* Index --> Index (no transformation)
* Blasts --> log(Blasts)


## Checking for multicollinearity

Next, let's check whether any of the variables are redundant with the others
for the purposes of a linear model.

```{r}
library(car)
## Since vif() expects a model, we'll fit a model with all predictors,
## only for the purposes of checking multicollinearity
fullModel <- 
  glm(Resp ~ Age + sqrt(Temp - 980) + Smear + Infil + Index + log(Blasts),
     family = "binomial", data = Train)
vif(fullModel)
```

Smear and Infil are fairly predictable from the others. Infil exceeds 
our VIF > 5 guideline. We won't exclude these entirely, because they
may not be predictable if we remove some other variables, but it's
something to keep in mind.


## Looking at single predictor models

The main purpose of this step is to assess whether we might want
any polynomial terms.

```{r}
library(arm)
age.model <- glm(Resp ~ Age, family = "binomial", data = Train)
binnedplot(
  fitted(age.model), 
  residuals(age.model, type = "pearson"), 
  nclass = 15)
```

There might be a bit of curvature there.

```{r}
temp.model <- glm(
  Resp ~ sqrt(Temp - 980), 
  family = "binomial", 
  data = Train)
binnedplot(
  fitted(temp.model), 
  residuals(temp.model, type = "pearson"), 
  nclass = 15)
```

This one looks OKish.

```{r}
smear.model <- glm(
  Resp ~ Smear, 
  family = "binomial", data = Train)
binnedplot(
  fitted(smear.model), 
  residuals(smear.model, type = "pearson"), 
  nclass = 15)
```

That one is a tough call.

```{r}
index.model <- glm(
  Resp ~ Index, 
  family = "binomial", 
  data = Train)
binnedplot(
  fitted(index.model), 
  residuals(index.model, type = "pearson"), 
  nclass = 15)
```

Looks OK.

```{r}
infil.model <- glm(
  Resp ~ Infil, family = "binomial", data = Train)
binnedplot(
  fitted(infil.model), 
  residuals(infil.model, type = "pearson"), 
  nclass = 15)
```

Probably OK?

```{r}
blasts.model <- glm(
  Resp ~ log(Blasts), 
  family = "binomial", 
  data = Train)
binnedplot(
  fitted(blasts.model), 
  residuals(blasts.model, type = "pearson"), 
  nclass = 15)
```

Looks OK.

So, we may want to consider adding $Age^2$

## Choosing combinations of predictors to assess.

Age and temperature are easy variables to collect, so let's include
those, and then consider which other predictors might be important.

As a quick first check, let's see if any of them are useful:
```{r}
baseModel <- 
  glm(Resp ~ Age + I(Age^2) + sqrt(Temp - 980),
      data = Train,
      family = "binomial")
fullModel <- glm(
  Resp ~ Age + I(Age^2) + sqrt(Temp - 980) +
         Smear + Index + Infil + log(Blasts), 
  data = Train, family = "binomial")
```

```{r}
anova(baseModel, fullModel, test = "LRT")
```

Pretty strong evidence that at least some of those biomarkers are helpful,
after controlling for Age and Temperature.

## Assessing models using AIC

The AIC is analogous to Mallow's Cp, and is a "penalized" measure
of how well a model fits a dataset.

Since there are four markers in question, there are sixteen 
(= 2^4) possible combinations of these markers we could include
in our model. We could fit all of these and examine their AIC,
but to cut down on the number we consider, let's first see if
there is one of these markers that provides all the information
we need on its own.

```{r}
IndexModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index,
  data = Train, family = "binomial"
)
SmearModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Smear,
  data = Train, family = "binomial"
)
InfilModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Infil,
  data = Train, family = "binomial"
)
BlastsModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + log(Blasts),
  data = Train, family = "binomial"
)
```

```{r}
AIC(SmearModel)
AIC(BlastsModel)
AIC(InfilModel)
AIC(IndexModel)
AIC(fullModel)
```

The `Index` marker comes close, but the full model still does a bit better.

Because Index comes so close to capturing everything in the full model,
we likely want it included. 

Next, let's fit the various models with Index and one other marker,
and see if any of these give us as good or better AIC as the full model.

```{r}
IndexModelS <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Smear,
  data = Train, family = "binomial"
)
IndexModelI <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Infil,
  data = Train, family = "binomial"
)
IndexModelB <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + log(Blasts),
  data = Train, family = "binomial"
)
```

```{r}
AIC(IndexModelS)
AIC(IndexModelI)
AIC(IndexModelB)
AIC(fullModel)
```

Looks like `Index` and `Infil` together give us a potentially better model
(at least, when fitting on this dataset) than including all four markers.
But so do `Index` and `Smear`. What about all three?

```{r}
ISIModel <- glm(
  Resp ~ poly(Age,2) + sqrt(Temp - 980) + Index + Smear + Infil,
  data = Train, family = "binomial"
)
```

```{r}
AIC(IndexModel)
AIC(IndexModelS)
AIC(IndexModelI)
AIC(IndexModelB)
AIC(ISIModel)
AIC(fullModel)
```

## Let's check the residual plots for these models.

```{r}
binnedplot(
  fitted(IndexModel),
  residuals(IndexModel, type = "pearson"))
binnedplot(
  fitted(IndexModelS),
  residuals(IndexModel, type = "pearson"))
binnedplot(
  fitted(IndexModelI),
  residuals(IndexModel, type = "pearson"))
binnedplot(
  fitted(IndexModelB),
  residuals(IndexModel, type = "pearson"))
binnedplot(
  fitted(ISIModel),
  residuals(IndexModel, type = "pearson"))
binnedplot(
  fitted(fullModel),
  residuals(IndexModel, type = "pearson"))
```

None of them show particular evidence of curvature.

## Checking for possible interactions

One question we might have is whether the relationship between
any of the biological markers and the response to treatment is
modulated by Age.  That is, should we interpret the markers
differently depending on the age of the person?

This is equivalent to asking whether we want to include
an interaction term between Age and any of the other variables.

```{r}
AgeInteractionsModel <- glm(
  Resp ~ (sqrt(Temp - 980) + Index + Infil + Smear + Blasts) * Age + I(Age^2),
  family = "binomial", data = Train)
anova(fullModel, AgeInteractionsModel, test = "LRT")
```

Not much evidence for any interactions there.

## Cross-Validation

We've settled on a pool of six models. All six include
`Temp`, `Age`, `Age^2` and `Index`. Then we also include

1. Nothing else
2. `Infil`
3. `Smear`
4. `Blasts`
5. `Infil` and `Smear`
6. `Infil`, `Smear` and `Blasts`

Let's do cross-validation with these six models.

```{r}
# source("http://colindawson.net/stat213/code/helper_functions.R")
source("~/213_site/code/helper_functions.R")
```

Some error metrics that make sense for logistic regression:

**Normalized deviance**: Residual deviance (on validation set) / sample size

We can use `normalized.deviance()` from `helper_functions.R`

**"Hard" misclassification rate**: Count prediction as an "error" if fitted probability is on the "wrong side" of 0.5 and compute proportion of 
validation cases that are "misclassified".

Can use `hard.classification.error()` from `helper_functions.R`

**False positive rate:**: What proportion of the true 0s are misclassified as 1s?

* This ignores the true 1s
* Use `false.positive.rate()`

**False negative rate:**: What proportion of real 1s are misclassified as 0s

* Ignores true 0s
* Use `false.negative.rate()`

**"Soft" misclassification rate:**: Count each point as a "partial error"
according to the predicted probability of the wrong category.

* Rewards confident correct predictions and penalizes confident 
wrong predictions, with "uncertain" predictions (near 0.5) in between.
* Compute average error per case in validation set
* Use `soft.classification.error()`

**Concordance, discordance, and Somers' D:**

* These measures consider all pairs of cases with different response values
(that is, one 0 and one 1)
* Count what proportion of the time the 1 is assigned a higher 
predicted prob. than the zero (good).  This is the **"concordance"**.
* Count what proportion of the time the opposite occurs. This is the **"discordance"** (bad)
* Note that these may not add to 1, due to possible tied predicted probs.
* **Somers' D** := Concordance - Discordance

We'll make a list of our six finalists, plus the "base" model (for comparison).
```{r}
finalists <- list(
  IndexOnly = IndexModel, 
  IndexInfil = IndexModelI,
  IndexSmear = IndexModelS,
  IndexBlasts = IndexModelB,
  IndexSmearInfil = ISIModel,
  AllFour = fullModel,
  None = baseModel)
```

Now we'll do 10-fold cross-validation, with 10 random splits for each
(to reduce the impact of the choice of folds):

```{r}
set.seed(8675309)
n <- nrow(Train)
## Normalized deviances (lower is better)
nds <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = normalized.deviance, iterations = 10)
nds
```

All six are relatively close compared to the gap between them and
the "base" model.

```{r}
## hard misclassification rate (lower is better)
hces <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = hard.classification.error, iterations = 10)
hces
```

The leader here is the model with all four predictors.

Let's examine the false-positive and false-negative rates separately,
since we presumably don't care equally about false-positive and false-negative
predictions.

```{r}
fprs <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = false.positive.rate, iterations = 10)
fprs
```

Lots of false positives for all models, but fewest for the model
that includes all four markers.

What about false negatives?

```{r}
fnrs <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = false.negative.rate, iterations = 10)
fnrs
```

No false negatives.  None of the models have any cases of predicting
that someone will fail to respond to treatment who actually does respond.

Seems like the models tend to be "optimistic" about responding to 
treatment overall.

How about "soft misclassification rate"?
```{r}
sces <- sapply(
  finalists, do.k.fold.cv, data = Train, num.folds = 10,
  error.metric = soft.classification.error, iterations = 10)
sces
```

Interestingly, while "hard misclassification rate" pushed us toward
the full model, weighting our misclassifications using the confidence
in the predicted outcome leads us toward the models with `Index` and `Infil`,
possibly with `Smear`.

Next, concordance, discordance, and Somer's d:
```{r}
## Higher is better
concordances <- 
  sapply(
    finalists, do.k.fold.cv, data = Train, num.folds = 10,
    error.metric = concordance, iterations = 10)
concordances
```

```{r}
## Lower is better
discordances <- 
  sapply(
    finalists, do.k.fold.cv, data = Train, num.folds = 10,
    error.metric = discordance, iterations = 10)
discordances
```

```{r}
## Higher is better
somers.ds <-
  sapply(
    finalists, do.k.fold.cv, data = Train, num.folds = 10,
    error.metric = somers.d, iterations = 10)
somers.ds  
```

As with the normalized deviance and the soft misclassification rate,
The two models with `Index` and `Infil` (with and without `Smear`)
look the best.

Which of these we would prefer depends on context; e.g., how invasive
and expensive is the procedure to collect the `Smear` variable, etc.,
but absent other information, I would tend to opt for the simpler model.

So, the winner is the model with `Age`, `Age^2`, `Temp`, `Index` and `Infil`.

## Evaluating performance on the test set.

For this final model, let's calculate our various performance measures
on the held out test set.

```{r}
## Hard misclassification rate
test.hce <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", 
  error.metric = hard.classification.error)
test.hce
```

```{r}
## False positive rate
test.fpr <- get.held.out.error(
    IndexModelI, test.set = Test, 
    response.name = "Resp", error.metric = false.positive.rate)
test.fpr
```

```{r}
## False negative rate
test.fnrs <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", error.metric = false.negative.rate)
test.fnrs
```

```{r}
## Soft misclassification rate
test.sces <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", error.metric = soft.classification.error)
test.sces
```

```{r}
## Normalized deviance
test.nd <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", error.metric = normalized.deviance)
test.nd
```

```{r}
test.somers.d <- get.held.out.error(
  IndexModelI, test.set = Test, 
  response.name = "Resp", error.metric = somers.d)
test.somers.d
```

## Fit on the full dataset and interpret

```{r}
FinalModel <- 
  glm(
    Resp ~ Age + I(Age^2) + sqrt(Temp - 980) + Index + Infil, 
    family = "binomial", data = Leukemia)
FinalModel %>% coef() %>% round(3)
```

### Prediction equation:

$$ \widehat{log odds} = 6.7 - 0.4\cdot Age + 0.003 \cdot Age^2 - 0.72 \cdot \sqrt{Temp - 980} + 0.4 \cdot Index + 0.034 \cdot Infil$$

### Odds and odds ratios:

```{r}
FinalModel %>% coef() %>% exp() %>% round(5)
```

It seems that higher Index scores are associated with a higher probability
of responding to treatment, after controlling for age, temperature, and
percentage infiltrate, with the odds of response increasing by about 
49% with each 1 point increase in Index relative to what is expected 
based on age and temperature.

Similarly, higher `Infil` scores are associated with a higher probability
of responding to treatment, after controlling for age, temperature, and `Index`,
with the odds of response increasing by about 3.4% for each 1 point difference
in `Infil`.

It seems that higher body temperatures (on a square root scale) 
are associated with a lower chance of recovery (controlling for age and biomarkers).

For each one unit increase in the square root of temperature 
in 10ths of a degree above 98F (relative to what is expected
given Age, Index and Infil), the odds of responding to treatment 
go down by 52%.

For patients around 40 years of age, and adjusting
for expected differences in biomarkers, an additional
5 years of age is associated with

```{r}
odds.factor.45vs40 <- 0.66957^(45-40)*1.00342^(45^2-40^2)
odds.factor.45vs40
```

about a 42% decrease in the odds of responding to treatment.

Around 50 years of age, an additional 5 years of age is
associated with

```{r}
odds.factor.55vs50 <- 0.66957^(55-50)*1.00342^(55^2-50^2)
odds.factor.55vs50
```

```{r}
odds.factor.65vs60 <- 0.66957^(65-60)*1.00342^(65^2-60^2)
odds.factor.65vs60
```

However, from 60 to 65, the odds of responding to treatment go back up, after
controlling for temperature and biomarkers. This may be because 65-year-olds
are expected to have worse biomarker indicators and/or temperature readings
than 60-year-olds, and so if a 65-year-old presented with the same indicators
as a 60-year-old, it might be an indication that it is a person with a stronger
system.

## Confidence intervals

Let's look at the predicted recovery probabilities for a few example patients.

```{r}
f.hat <- makeFun.logistic(FinalModel)
```

To see what sensible values to plug in for `Index` and `Infil` are, let's look
at a scatterplot and five number summaries for these markers.

```{r}
fivenum(~Index, data = Leukemia)
fivenum(~Infil, data = Leukemia)
```

```{r}
gf_point(Index ~ Infil, data = Leukemia)
```

**Patient 1a: 20 years old with relatively high index and infil scores, and a normal body temperature** (favorable scenario)

```{r}
f.hat(Age = 20, Temp = 986, Index = 13.5, Infil = 72.5, interval = "confidence")
```

We estimate that a young patient with high index and infil scores and a normal body temp
has at least a 90% chance of responding to treatment.

**Patient 1b: same as Patient 1a but with a fever of 101F**

```{r}
f.hat(Age = 20, Temp = 1010, Index = 13.5, Infil = 72.5, interval = "confidence")
```

If a patient like this has a fever, the probability of responding might well still be near 100%, 
but it might be down to something like 67%.

**Patient 2a: 50 years old, moderate index and infil scores (near the median), normal body temp**

```{r}
f.hat(Age = 50, Temp = 986, Index = 9, Infil = 61, interval = "confidence")
```

Patients like this might have anywhere between a 10% and 60% chance of responding.

**Patient 2b: Same as 2a but with a fever of 101F**

```{r}
f.hat(Age = 50, Temp = 1010, Index = 9, Infil = 61, interval = "confidence")
```

The presence of a fever of 101F worsens the outlook for this patient considerably. 
The probability of responding goes down to somewhere between 0.5% and 31%.

**Patient 3: 75 years old, fever of 101F, low index and infil (unfavorable scenario)**

```{r}
f.hat(Age = 75, Temp = 1010, Index = 6, Infil = 37.5, interval = "confidence")
```

Even if our model is pessimistic, we doubt that the probability of responding to treatment 
is higher than 27% or so for a patient like this, and more likely it's less than 1%.