STAT 213: Model Selection Lab

How can we decide what the “best” model is from a set of candidates? We have seen some methods already:

Some models may be preferable than others due to better satisfying the regression conditions (e.g., when we decide to add a quadratic term, or when we decide to transform one or more variables)
If we have just a couple candidate models we’re considering, and one is nested in the other, we can make use of nested model comparisons (e.g., \(t\) or \(F\) tests)

But what if we have more than a couple models that reasonably satisfy the regression conditions, or we have candidate models that aren’t nested?

We have measures like \(R^2\) and the sum of squared residuals which tell us how closely a given model fits our data. But this is not necessarily a good tool to use to choose between models that have different numbers of free parameters (coefficients); why not?

SOLUTION

Creating Simulated Data According to a Known Population Process

Sometimes it can be useful to generate simulated data, where we know exactly what the “correct” model/population looks like, so that we can see how our various statistical metrics and procedures perform.

In the following simulation, we’ll create a dataset sampled from a known population process, which is characterized by the following model:

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \beta_3 X_{i3} + \varepsilon_i \]

where the residuals (\(\varepsilon_i\)s) are independent of each other, and are distributed Normally with a mean of zero and a standard deviation of 1. So there are three predictors, \(X_1\), \(X_2\) and \(X_3\) that are relevant, and no more.

That is, we’re creating a population in which there is a “true” model that satisfies our standard regression conditions.

The following code will generate a dataset of a specified size (here, 15 data points) which includes the three relevant predictors as well as seven more irrelevant predictors.

The set.seed() command “anchors” the random number generator so that if you run the chunk multiple times with the same “seed” you will get the same dataset, but if you change the seed you’ll get a different dataset.

Change the value in set.seed() so you get a different dataset from other groups.

set.seed(4321)
## N = number of data points to generate
## K = number of potential predictors to generate
N <- 15; K <- 10

## Generate random predictors (from a Normal distribution)
Predictors <- do(N) * rnorm(K, mean = 0, sd = 1)

## Name them X1, ..., XK (here K = 10)
names(Predictors) <- paste0("X", 1:K)

Here’s what the dataset looks like so far

Predictors

##             X1           X2           X3         X4          X5
## 1  -0.06723632  0.344367097 -1.260985237  1.1394641 -1.22178192
## 2   0.48352186 -0.003025539 -0.008930402  0.5933576 -0.09920208
## 3   0.14027895  0.662125963  1.131039665 -0.4751120  0.85241411
## 4  -0.14679568  1.183290044  0.419156068 -0.7710384  0.71452770
## 5   0.23099272  0.209720171  1.474832991  2.0802482  0.14841417
## 6  -0.12422362 -0.019105983 -0.529369588  0.8234882 -0.09490660
## 7   0.72906491 -1.308636074 -0.571149276  0.7641005  0.92692699
## 8  -2.45001626  0.564470520  0.031525945  1.8675152 -2.43932045
## 9  -0.58778361 -1.039416308  0.155512893 -0.1596948  1.85649268
## 10  0.80465245  0.543004777 -0.836353314 -0.2528785  1.40328134
## 11  0.71583781 -0.965897366 -0.638773126 -1.2241052 -0.50995957
## 12 -1.62122610 -0.938844874  1.141673229  0.5780161  0.61579979
## 13  0.10937685  0.736209672  2.151278838 -2.1621296  0.03722538
## 14  0.92418032  0.133799711 -0.765966795  1.5728608 -0.30395797
## 15 -1.88670515  0.477533977  0.389948104 -0.1341345 -0.47311218
##             X6          X7          X8          X9         X10
## 1   1.57331589  0.07347787 -1.17511509 -1.58826190 -0.74738073
## 2  -0.23803425  0.04778266  0.29651274 -0.83380992 -1.37397000
## 3  -0.75151885 -0.17566620  0.75196443  0.23721090  1.26749634
## 4   0.10010522  1.03740917  0.84239825  0.09036045  0.01972571
## 5   1.71852758  1.03921216 -1.18160387  1.02843119  0.04647717
## 6  -0.41965428 -0.77916316  0.04815703  0.21454715  0.18274507
## 7  -0.55190791  0.20261885  0.43240423  0.18691659 -0.92736819
## 8   0.58528027  1.24595327 -1.07168921 -0.34279289  0.04284334
## 9   1.30870356 -1.62019637  1.68940879 -0.04145376  1.05665648
## 10  0.24342600 -1.08617858  1.41389620  1.55033432 -0.72895214
## 11  1.14655152 -1.33544032 -0.95273673  0.17503179  0.01635606
## 12  0.95300884 -0.87182069 -0.55269482 -0.31294268 -0.31854207
## 13 -0.01748769 -1.15753651  0.60970025  1.16305930 -0.22062731
## 14  0.71335595 -1.14783543  0.46421474  1.70179171  0.14326326
## 15 -0.37919602 -1.16207650  1.22391159  0.01205071 -0.85406964

Next we’ll generate the true coefficients for the population model.

Before you run the chunk, change the random seed so you get your own unique values.

set.seed(8675309)
## set the true beta0 to 0
beta0 <- 0
## generate the betas for the "useful" predictors randomly between -1 and 1
beta1 <- runif(1, min = -1, 1)
beta2 <- runif(1, min = -1, 1)
beta3 <- runif(1, min = -1, 1)

Here are the values for the coefficients (other than the intercept, which we set to zero) that we got

paste("beta1 =", round(beta1, 2))

## [1] "beta1 = -0.68"

paste("beta2 =", round(beta2, 2))

## [1] "beta2 = -0.04"

paste("beta3 =", round(beta3, 2))

## [1] "beta3 = 0.53"

Write out the prediction equation for the true population model with the coefficients filled in.

SOLUTION

Now that we have the true population model, let’s generate values of the response variable for our fake dataset.

FakeData <- Predictors %>%
  mutate(
    Y_hat_true   = beta1 * X1 + beta2 * X2 + beta3 * X3,
    epsilon_true = rnorm(N, mean = 0, sd = 1),
    Y            = Y_hat_true + epsilon_true)

We wouldn’t normally have the “true” \(\hat{Y}\) values or the true \(\varepsilon\) values of course, but since this is a simulation we can keep track of those for the purposes of evaluating the performance of our procedures.

Here’s what our full dataset looks like.

FakeData

##             X1           X2           X3         X4          X5
## 1  -0.06723632  0.344367097 -1.260985237  1.1394641 -1.22178192
## 2   0.48352186 -0.003025539 -0.008930402  0.5933576 -0.09920208
## 3   0.14027895  0.662125963  1.131039665 -0.4751120  0.85241411
## 4  -0.14679568  1.183290044  0.419156068 -0.7710384  0.71452770
## 5   0.23099272  0.209720171  1.474832991  2.0802482  0.14841417
## 6  -0.12422362 -0.019105983 -0.529369588  0.8234882 -0.09490660
## 7   0.72906491 -1.308636074 -0.571149276  0.7641005  0.92692699
## 8  -2.45001626  0.564470520  0.031525945  1.8675152 -2.43932045
## 9  -0.58778361 -1.039416308  0.155512893 -0.1596948  1.85649268
## 10  0.80465245  0.543004777 -0.836353314 -0.2528785  1.40328134
## 11  0.71583781 -0.965897366 -0.638773126 -1.2241052 -0.50995957
## 12 -1.62122610 -0.938844874  1.141673229  0.5780161  0.61579979
## 13  0.10937685  0.736209672  2.151278838 -2.1621296  0.03722538
## 14  0.92418032  0.133799711 -0.765966795  1.5728608 -0.30395797
## 15 -1.88670515  0.477533977  0.389948104 -0.1341345 -0.47311218
##             X6          X7          X8          X9         X10 Y_hat_true
## 1   1.57331589  0.07347787 -1.17511509 -1.58826190 -0.74738073 -0.6370468
## 2  -0.23803425  0.04778266  0.29651274 -0.83380992 -1.37397000 -0.3338917
## 3  -0.75151885 -0.17566620  0.75196443  0.23721090  1.26749634  0.4745769
## 4   0.10010522  1.03740917  0.84239825  0.09036045  0.01972571  0.2703376
## 5   1.71852758  1.03921216 -1.18160387  1.02843119  0.04647717  0.6146054
## 6  -0.41965428 -0.77916316  0.04815703  0.21454715  0.18274507 -0.1949189
## 7  -0.55190791  0.20261885  0.43240423  0.18691659 -0.92736819 -0.7419092
## 8   0.58528027  1.24595327 -1.07168921 -0.34279289  0.04284334  1.6606132
## 9   1.30870356 -1.62019637  1.68940879 -0.04145376  1.05665648  0.5280019
## 10  0.24342600 -1.08617858  1.41389620  1.55033432 -0.72895214 -1.0146129
## 11  1.14655152 -1.33544032 -0.95273673  0.17503179  0.01635606 -0.7836659
## 12  0.95300884 -0.87182069 -0.55269482 -0.31294268 -0.31854207  1.7496908
## 13 -0.01748769 -1.15753651  0.60970025  1.16305930 -0.22062731  1.0327065
## 14  0.71335595 -1.14783543  0.46421474  1.70179171  0.14326326 -1.0408879
## 15 -0.37919602 -1.16207650  1.22391159  0.01205071 -0.85406964  1.4705918
##    epsilon_true            Y
## 1     0.7378187  0.100771865
## 2     0.4483395  0.114447778
## 3     1.0208067  1.495383611
## 4    -0.1378989  0.132438629
## 5     0.2103863  0.824991735
## 6    -0.6428271 -0.837746058
## 7     1.3932008  0.651291630
## 8    -0.8195051  0.841108099
## 9     0.5705848  1.098586689
## 10    1.0240772  0.009464289
## 11   -0.4327152 -1.216381066
## 12    0.4384483  2.188139075
## 13   -1.4106443 -0.377937751
## 14    0.9016948 -0.139193056
## 15    1.0030439  2.473635682

Remember that although our dataset has ten predictors in it, only the first three were actually used to generate the Y variable. So the other 7 are structurally completely unrelated to the response. However, that doesn’t mean that there aren’t correlations between them and the response in the 15 datapoints we’ve generated. Let’s look at the correlations between each pair of variables.

FakeData %>% 
  select(X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, Y) %>%
  cor() %>%
  round(3)

##         X1     X2     X3     X4     X5     X6     X7     X8     X9    X10
## X1   1.000 -0.101 -0.268 -0.180  0.410 -0.048 -0.162  0.138  0.404 -0.061
## X2  -0.101  1.000  0.233 -0.105 -0.277 -0.231  0.366  0.108  0.159  0.024
## X3  -0.268  0.233  1.000 -0.310  0.174 -0.090  0.103  0.053  0.214  0.282
## X4  -0.180 -0.105 -0.310  1.000 -0.376  0.317  0.481 -0.476 -0.153 -0.116
## X5   0.410 -0.277  0.174 -0.376  1.000 -0.161 -0.363  0.680  0.344  0.234
## X6  -0.048 -0.231 -0.090  0.317 -0.161  1.000  0.022 -0.517 -0.099  0.130
## X7  -0.162  0.366  0.103  0.481 -0.363  0.022  1.000 -0.463 -0.281 -0.110
## X8   0.138  0.108  0.053 -0.476  0.680 -0.517 -0.463  1.000  0.352  0.125
## X9   0.404  0.159  0.214 -0.153  0.344 -0.099 -0.281  0.352  1.000  0.225
## X10 -0.061  0.024  0.282 -0.116  0.234  0.130 -0.110  0.125  0.225  1.000
## Y   -0.643 -0.039  0.391  0.189  0.141 -0.078  0.074  0.191 -0.203  0.053
##          Y
## X1  -0.643
## X2  -0.039
## X3   0.391
## X4   0.189
## X5   0.141
## X6  -0.078
## X7   0.074
## X8   0.191
## X9  -0.203
## X10  0.053
## Y    1.000

Which predictor variables have the strongest sample correlations with Y? Are they the ones that are actually structurally related to Y?

SOLUTION

Fitting Some Candidate Models

The following code will fit eleven models to our data. The first will use no predictors; the second will use just X1; the third will use just X1 and X2; etc. In reality we wouldn’t necessarily have an obvious ordering to the candidate predictors like this, but this simulation will illustrate what happens as we first add useful predictors (in terms of the population model) to the model, and then start adding useless predictors (in terms of the population model).

The first few models are underfitting the data; they use too few predictors.

m0 <- lm(Y ~ 1,       data = FakeData)
m1 <- lm(Y ~ X1,      data = FakeData)
m2 <- lm(Y ~ X1 + X2, data = FakeData)

The next one has the correct functional form (though as always, it won’t have exactly the correct coefficients due to the fact that it’s fit on a specific dataset with its own idiosyncracies).

m3 <- lm(Y ~ X1 + X2 + X3, data = FakeData)

The remaining models are overfitting the data; they include too many predictors, and are starting to fit the idiosyncracies/noise as if it were part of the systematic pattern.

m4 <- lm(Y ~ X1 + X2 + X3 + X4,                                 data = FakeData)
m5 <- lm(Y ~ X1 + X2 + X3 + X4 + X5,                            data = FakeData)
m6 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6,                       data = FakeData)
m7 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7,                  data = FakeData)
m8 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8,             data = FakeData)
m9 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9,        data = FakeData)
m10 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10, data = FakeData)

Evaluating the Fit to the Data

Let’s calculate the \(R^2\) values for each of these models. The following code will calculate all eleven at once and collect them into a data frame.

sum_squared_values <- function(values) { sum(values^2) }
sum_squard_residuals <- function(model) {sum_squared_values(residuals(model))}

model_list <- list(m0, m1, m2, m3, m4, m5, m6, m7, m8, m9, m10)
fit_measures <- tibble(
  model_number          = 0:10,
  sum_squared_residuals = sapply(model_list, sum_squard_residuals),
  r_squared             = sapply(model_list, rsquared))

Because we know the true population model, we can also calculate the proportion of variability in the Y variable that can be accounted for using the true generating process. This will give us a measure of how much “signal” about the population process is potentially available to be extracted from this data if we had the perfect model.

true_sum_squared_residuals <- sum(~(epsilon_true^2), data = FakeData)
true_rsquared <- cor(Y ~ Y_hat_true, data = FakeData)^2

Let’s plot the fitted sums of squared residuals and \(R^2\) values for our eleven models as a function of how many predictors they use (with the number that are actually relevant for the population indicated by a vertical dotted line), and compare to the \(R^2\) we’d obtain using the optimal model (represented by the horizontal dashed line).

gf_point(sum_squared_residuals ~ model_number, data = fit_measures) +
  scale_x_continuous(name = "Number of Predictors Used", breaks = 0:10) +
  scale_y_continuous("Sum of Squared Residuals") +
  geom_hline(yintercept = true_sum_squared_residuals, linetype = 2) +
  geom_vline(xintercept = 3, linetype = 3)

## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.

gf_point(r_squared ~ model_number, data = fit_measures) +
  scale_x_continuous(name = "Number of Predictors Used", breaks = 0:10) +
  scale_y_continuous(expression(R^2), limits = c(0,1), breaks = seq(0,1,0.1)) +
  geom_hline(yintercept = true_rsquared, linetype = 2) +
  geom_vline(xintercept = 3, linetype = 3)

What happens to the \(R^2\) value as we make the model overly complex by adding structurally irrelevant predictors? How do these fitted \(R^2\) values compare to the \(R^2\) values from the correct population model?

ANSWER

Generalization Error

Does it actually matter in practice if we use a model that includes irrelevant predictors? Also, is this artificial example really meaningful, since in the real world most things are at least somewhat correlated with most other things, so is there really such a thing as a completely useless predictor?

There are a couple of reasons why using models that include unnecessary or only marginally relevant predictors might not be desirable.

For one, adding unnecessary or barely useful complexity makes our models more difficult to interpret.

But even if all we care about is predictions and not interpretability, another reason it can be harmful to add irrelevant or very weakly relevant predictors is that it can actually cause our prediction performance to degrade.

But wait, didn’t we just see that \(R^2\), which captures our model’s ability to make predictions close to the response values, keeps going up when we add predictors?

Well, yes, as long as we’re measuring it on the data we used to fit the model.

There’s no guarantee that this same thing will happen if we try to use the model to make predictions for new datapoints.

Let’s generate a second dataset using the same predictor values and same true population model (so the true “Y hat” values are the same too). The only thing that will change is the residuals, and therefore also the sample response values.

Change the seed as before before running the chunk below.

set.seed(29747)
TestData <- Predictors %>%
  mutate(
    Y_hat_true   = beta1 * X1 + beta2 * X2 + beta3 * X3,
    epsilon_true = rnorm(N, mean = 0, sd = 1),
    Y            = Y_hat_true + epsilon_true)

Here’s what our “test” data looks like.

TestData

##             X1           X2           X3         X4          X5
## 1  -0.06723632  0.344367097 -1.260985237  1.1394641 -1.22178192
## 2   0.48352186 -0.003025539 -0.008930402  0.5933576 -0.09920208
## 3   0.14027895  0.662125963  1.131039665 -0.4751120  0.85241411
## 4  -0.14679568  1.183290044  0.419156068 -0.7710384  0.71452770
## 5   0.23099272  0.209720171  1.474832991  2.0802482  0.14841417
## 6  -0.12422362 -0.019105983 -0.529369588  0.8234882 -0.09490660
## 7   0.72906491 -1.308636074 -0.571149276  0.7641005  0.92692699
## 8  -2.45001626  0.564470520  0.031525945  1.8675152 -2.43932045
## 9  -0.58778361 -1.039416308  0.155512893 -0.1596948  1.85649268
## 10  0.80465245  0.543004777 -0.836353314 -0.2528785  1.40328134
## 11  0.71583781 -0.965897366 -0.638773126 -1.2241052 -0.50995957
## 12 -1.62122610 -0.938844874  1.141673229  0.5780161  0.61579979
## 13  0.10937685  0.736209672  2.151278838 -2.1621296  0.03722538
## 14  0.92418032  0.133799711 -0.765966795  1.5728608 -0.30395797
## 15 -1.88670515  0.477533977  0.389948104 -0.1341345 -0.47311218
##             X6          X7          X8          X9         X10 Y_hat_true
## 1   1.57331589  0.07347787 -1.17511509 -1.58826190 -0.74738073 -0.6370468
## 2  -0.23803425  0.04778266  0.29651274 -0.83380992 -1.37397000 -0.3338917
## 3  -0.75151885 -0.17566620  0.75196443  0.23721090  1.26749634  0.4745769
## 4   0.10010522  1.03740917  0.84239825  0.09036045  0.01972571  0.2703376
## 5   1.71852758  1.03921216 -1.18160387  1.02843119  0.04647717  0.6146054
## 6  -0.41965428 -0.77916316  0.04815703  0.21454715  0.18274507 -0.1949189
## 7  -0.55190791  0.20261885  0.43240423  0.18691659 -0.92736819 -0.7419092
## 8   0.58528027  1.24595327 -1.07168921 -0.34279289  0.04284334  1.6606132
## 9   1.30870356 -1.62019637  1.68940879 -0.04145376  1.05665648  0.5280019
## 10  0.24342600 -1.08617858  1.41389620  1.55033432 -0.72895214 -1.0146129
## 11  1.14655152 -1.33544032 -0.95273673  0.17503179  0.01635606 -0.7836659
## 12  0.95300884 -0.87182069 -0.55269482 -0.31294268 -0.31854207  1.7496908
## 13 -0.01748769 -1.15753651  0.60970025  1.16305930 -0.22062731  1.0327065
## 14  0.71335595 -1.14783543  0.46421474  1.70179171  0.14326326 -1.0408879
## 15 -0.37919602 -1.16207650  1.22391159  0.01205071 -0.85406964  1.4705918
##    epsilon_true          Y
## 1    0.06331548 -0.5737313
## 2    1.74351271  1.4096210
## 3   -0.28378807  0.1907889
## 4   -0.71347644 -0.4431389
## 5    0.23891517  0.8535206
## 6    1.87388159  1.6789627
## 7   -0.45193378 -1.1938429
## 8    0.97842021  2.6390334
## 9   -1.36031967 -0.8323178
## 10  -1.82082452 -2.8354374
## 11   0.02899077 -0.7546751
## 12  -1.25768082  0.4920099
## 13   0.84769610  1.8804026
## 14   0.07001422 -0.9708736
## 15  -1.43786452  0.0327273

Let’s see how good a job each of the models we fit on the first dataset can do at predicting the Y values in the second dataset. The following code will generate predicted values for each of the eleven models using the new test dataset.

predictions_by_model <- sapply(model_list, predict, newdata = TestData)
colnames(predictions_by_model) <- paste0("m",0:10)

Now let’s compare those predictions to the actual Y values in the test data.

generalization_errors_by_model <- with(TestData, Y - predictions_by_model) %>% as_tibble()
generalization_sse_by_model <- generalization_errors_by_model %>%
  summarize_all(sum_squared_values)

Here are the sums of squared errors that each model obtains, using the original dataset to estimate coefficients, making predictions for the test dataset.

generalization_sse_by_model

## # A tibble: 1 x 11
##      m0    m1    m2    m3    m4    m5    m6    m7    m8    m9   m10
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  30.6  23.7  24.6  21.9  20.7  32.1  31.1  31.2  31.3  30.5  31.3

And here’s a graph of the same:

library(olsrr)
fit_measures <- fit_measures %>%
  mutate(
    generalization_sse      = t(generalization_sse_by_model),
    generalization_rsquared = sapply(model_list, ols_pred_rsq)
  )

gf_point(generalization_sse ~ model_number, data = fit_measures) +
  scale_x_continuous(name = "Number of Predictors Used", breaks = 0:10) +
  scale_y_continuous("Sum of Squared Residuals") +
  geom_hline(yintercept = true_sum_squared_residuals, linetype = 2) +
  geom_vline(xintercept = 3, linetype = 3)

Which model achieves the lowest prediction error when attempting to generalize to new data? How does the model that uses all and only “useful” predictors fare on this measure as compared to using in-sample SSE or \(R^2\)?

SOLUTION

How do the generalization errors compare overall to the in-sample error measures?

STAT 213: Model Selection Lab

Colin Reimer Dawson

July 16, 2021

SOLUTION

Creating Simulated Data According to a Known Population Process

SOLUTION

SOLUTION

Fitting Some Candidate Models

Evaluating the Fit to the Data

ANSWER

Generalization Error

SOLUTION

SOLUTION