How can we decide what the “best” model is from a set of candidates? We have seen some methods already:
But what if we have more than a couple models that reasonably satisfy the regression conditions, or we have candidate models that aren’t nested?
Sometimes it can be useful to generate simulated data, where we know exactly what the “correct” model/population looks like, so that we can see how our various statistical metrics and procedures perform.
In the following simulation, we’ll create a dataset sampled from a known population process, which is characterized by the following model:
\[ Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \beta_3 X_{i3} + \varepsilon_i \]
where the residuals (\(\varepsilon_i\)s) are independent of each other, and are distributed Normally with a mean of zero and a standard deviation of 1. So there are three predictors, \(X_1\), \(X_2\) and \(X_3\) that are relevant, and no more.
That is, we’re creating a population in which there is a “true” model that satisfies our standard regression conditions.
The following code will generate a dataset of a specified size (here, 15 data points) which includes the three relevant predictors as well as seven more irrelevant predictors.
The set.seed()
command “anchors” the random number generator so that if you run the chunk multiple times with the same “seed” you will get the same dataset, but if you change the seed you’ll get a different dataset.
set.seed()
so you get a different dataset from other groups.set.seed(4321)
## N = number of data points to generate
## K = number of potential predictors to generate
N <- 15; K <- 10
## Generate random predictors (from a Normal distribution)
Predictors <- do(N) * rnorm(K, mean = 0, sd = 1)
## Name them X1, ..., XK (here K = 10)
names(Predictors) <- paste0("X", 1:K)
Here’s what the dataset looks like so far
## X1 X2 X3 X4 X5
## 1 -0.06723632 0.344367097 -1.260985237 1.1394641 -1.22178192
## 2 0.48352186 -0.003025539 -0.008930402 0.5933576 -0.09920208
## 3 0.14027895 0.662125963 1.131039665 -0.4751120 0.85241411
## 4 -0.14679568 1.183290044 0.419156068 -0.7710384 0.71452770
## 5 0.23099272 0.209720171 1.474832991 2.0802482 0.14841417
## 6 -0.12422362 -0.019105983 -0.529369588 0.8234882 -0.09490660
## 7 0.72906491 -1.308636074 -0.571149276 0.7641005 0.92692699
## 8 -2.45001626 0.564470520 0.031525945 1.8675152 -2.43932045
## 9 -0.58778361 -1.039416308 0.155512893 -0.1596948 1.85649268
## 10 0.80465245 0.543004777 -0.836353314 -0.2528785 1.40328134
## 11 0.71583781 -0.965897366 -0.638773126 -1.2241052 -0.50995957
## 12 -1.62122610 -0.938844874 1.141673229 0.5780161 0.61579979
## 13 0.10937685 0.736209672 2.151278838 -2.1621296 0.03722538
## 14 0.92418032 0.133799711 -0.765966795 1.5728608 -0.30395797
## 15 -1.88670515 0.477533977 0.389948104 -0.1341345 -0.47311218
## X6 X7 X8 X9 X10
## 1 1.57331589 0.07347787 -1.17511509 -1.58826190 -0.74738073
## 2 -0.23803425 0.04778266 0.29651274 -0.83380992 -1.37397000
## 3 -0.75151885 -0.17566620 0.75196443 0.23721090 1.26749634
## 4 0.10010522 1.03740917 0.84239825 0.09036045 0.01972571
## 5 1.71852758 1.03921216 -1.18160387 1.02843119 0.04647717
## 6 -0.41965428 -0.77916316 0.04815703 0.21454715 0.18274507
## 7 -0.55190791 0.20261885 0.43240423 0.18691659 -0.92736819
## 8 0.58528027 1.24595327 -1.07168921 -0.34279289 0.04284334
## 9 1.30870356 -1.62019637 1.68940879 -0.04145376 1.05665648
## 10 0.24342600 -1.08617858 1.41389620 1.55033432 -0.72895214
## 11 1.14655152 -1.33544032 -0.95273673 0.17503179 0.01635606
## 12 0.95300884 -0.87182069 -0.55269482 -0.31294268 -0.31854207
## 13 -0.01748769 -1.15753651 0.60970025 1.16305930 -0.22062731
## 14 0.71335595 -1.14783543 0.46421474 1.70179171 0.14326326
## 15 -0.37919602 -1.16207650 1.22391159 0.01205071 -0.85406964
Next we’ll generate the true coefficients for the population model.
set.seed(8675309)
## set the true beta0 to 0
beta0 <- 0
## generate the betas for the "useful" predictors randomly between -1 and 1
beta1 <- runif(1, min = -1, 1)
beta2 <- runif(1, min = -1, 1)
beta3 <- runif(1, min = -1, 1)
Here are the values for the coefficients (other than the intercept, which we set to zero) that we got
## [1] "beta1 = -0.68"
## [1] "beta2 = -0.04"
## [1] "beta3 = 0.53"
Now that we have the true population model, let’s generate values of the response variable for our fake dataset.
FakeData <- Predictors %>%
mutate(
Y_hat_true = beta1 * X1 + beta2 * X2 + beta3 * X3,
epsilon_true = rnorm(N, mean = 0, sd = 1),
Y = Y_hat_true + epsilon_true)
We wouldn’t normally have the “true” \(\hat{Y}\) values or the true \(\varepsilon\) values of course, but since this is a simulation we can keep track of those for the purposes of evaluating the performance of our procedures.
Here’s what our full dataset looks like.
## X1 X2 X3 X4 X5
## 1 -0.06723632 0.344367097 -1.260985237 1.1394641 -1.22178192
## 2 0.48352186 -0.003025539 -0.008930402 0.5933576 -0.09920208
## 3 0.14027895 0.662125963 1.131039665 -0.4751120 0.85241411
## 4 -0.14679568 1.183290044 0.419156068 -0.7710384 0.71452770
## 5 0.23099272 0.209720171 1.474832991 2.0802482 0.14841417
## 6 -0.12422362 -0.019105983 -0.529369588 0.8234882 -0.09490660
## 7 0.72906491 -1.308636074 -0.571149276 0.7641005 0.92692699
## 8 -2.45001626 0.564470520 0.031525945 1.8675152 -2.43932045
## 9 -0.58778361 -1.039416308 0.155512893 -0.1596948 1.85649268
## 10 0.80465245 0.543004777 -0.836353314 -0.2528785 1.40328134
## 11 0.71583781 -0.965897366 -0.638773126 -1.2241052 -0.50995957
## 12 -1.62122610 -0.938844874 1.141673229 0.5780161 0.61579979
## 13 0.10937685 0.736209672 2.151278838 -2.1621296 0.03722538
## 14 0.92418032 0.133799711 -0.765966795 1.5728608 -0.30395797
## 15 -1.88670515 0.477533977 0.389948104 -0.1341345 -0.47311218
## X6 X7 X8 X9 X10 Y_hat_true
## 1 1.57331589 0.07347787 -1.17511509 -1.58826190 -0.74738073 -0.6370468
## 2 -0.23803425 0.04778266 0.29651274 -0.83380992 -1.37397000 -0.3338917
## 3 -0.75151885 -0.17566620 0.75196443 0.23721090 1.26749634 0.4745769
## 4 0.10010522 1.03740917 0.84239825 0.09036045 0.01972571 0.2703376
## 5 1.71852758 1.03921216 -1.18160387 1.02843119 0.04647717 0.6146054
## 6 -0.41965428 -0.77916316 0.04815703 0.21454715 0.18274507 -0.1949189
## 7 -0.55190791 0.20261885 0.43240423 0.18691659 -0.92736819 -0.7419092
## 8 0.58528027 1.24595327 -1.07168921 -0.34279289 0.04284334 1.6606132
## 9 1.30870356 -1.62019637 1.68940879 -0.04145376 1.05665648 0.5280019
## 10 0.24342600 -1.08617858 1.41389620 1.55033432 -0.72895214 -1.0146129
## 11 1.14655152 -1.33544032 -0.95273673 0.17503179 0.01635606 -0.7836659
## 12 0.95300884 -0.87182069 -0.55269482 -0.31294268 -0.31854207 1.7496908
## 13 -0.01748769 -1.15753651 0.60970025 1.16305930 -0.22062731 1.0327065
## 14 0.71335595 -1.14783543 0.46421474 1.70179171 0.14326326 -1.0408879
## 15 -0.37919602 -1.16207650 1.22391159 0.01205071 -0.85406964 1.4705918
## epsilon_true Y
## 1 0.7378187 0.100771865
## 2 0.4483395 0.114447778
## 3 1.0208067 1.495383611
## 4 -0.1378989 0.132438629
## 5 0.2103863 0.824991735
## 6 -0.6428271 -0.837746058
## 7 1.3932008 0.651291630
## 8 -0.8195051 0.841108099
## 9 0.5705848 1.098586689
## 10 1.0240772 0.009464289
## 11 -0.4327152 -1.216381066
## 12 0.4384483 2.188139075
## 13 -1.4106443 -0.377937751
## 14 0.9016948 -0.139193056
## 15 1.0030439 2.473635682
Remember that although our dataset has ten predictors in it, only the first three were actually used to generate the Y variable. So the other 7 are structurally completely unrelated to the response. However, that doesn’t mean that there aren’t correlations between them and the response in the 15 datapoints we’ve generated. Let’s look at the correlations between each pair of variables.
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
## X1 1.000 -0.101 -0.268 -0.180 0.410 -0.048 -0.162 0.138 0.404 -0.061
## X2 -0.101 1.000 0.233 -0.105 -0.277 -0.231 0.366 0.108 0.159 0.024
## X3 -0.268 0.233 1.000 -0.310 0.174 -0.090 0.103 0.053 0.214 0.282
## X4 -0.180 -0.105 -0.310 1.000 -0.376 0.317 0.481 -0.476 -0.153 -0.116
## X5 0.410 -0.277 0.174 -0.376 1.000 -0.161 -0.363 0.680 0.344 0.234
## X6 -0.048 -0.231 -0.090 0.317 -0.161 1.000 0.022 -0.517 -0.099 0.130
## X7 -0.162 0.366 0.103 0.481 -0.363 0.022 1.000 -0.463 -0.281 -0.110
## X8 0.138 0.108 0.053 -0.476 0.680 -0.517 -0.463 1.000 0.352 0.125
## X9 0.404 0.159 0.214 -0.153 0.344 -0.099 -0.281 0.352 1.000 0.225
## X10 -0.061 0.024 0.282 -0.116 0.234 0.130 -0.110 0.125 0.225 1.000
## Y -0.643 -0.039 0.391 0.189 0.141 -0.078 0.074 0.191 -0.203 0.053
## Y
## X1 -0.643
## X2 -0.039
## X3 0.391
## X4 0.189
## X5 0.141
## X6 -0.078
## X7 0.074
## X8 0.191
## X9 -0.203
## X10 0.053
## Y 1.000
The following code will fit eleven models to our data. The first will use no predictors; the second will use just X1; the third will use just X1 and X2; etc. In reality we wouldn’t necessarily have an obvious ordering to the candidate predictors like this, but this simulation will illustrate what happens as we first add useful predictors (in terms of the population model) to the model, and then start adding useless predictors (in terms of the population model).
The first few models are underfitting the data; they use too few predictors.
m0 <- lm(Y ~ 1, data = FakeData)
m1 <- lm(Y ~ X1, data = FakeData)
m2 <- lm(Y ~ X1 + X2, data = FakeData)
The next one has the correct functional form (though as always, it won’t have exactly the correct coefficients due to the fact that it’s fit on a specific dataset with its own idiosyncracies).
The remaining models are overfitting the data; they include too many predictors, and are starting to fit the idiosyncracies/noise as if it were part of the systematic pattern.
m4 <- lm(Y ~ X1 + X2 + X3 + X4, data = FakeData)
m5 <- lm(Y ~ X1 + X2 + X3 + X4 + X5, data = FakeData)
m6 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6, data = FakeData)
m7 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7, data = FakeData)
m8 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8, data = FakeData)
m9 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9, data = FakeData)
m10 <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10, data = FakeData)
Let’s calculate the \(R^2\) values for each of these models. The following code will calculate all eleven at once and collect them into a data frame.
sum_squared_values <- function(values) { sum(values^2) }
sum_squard_residuals <- function(model) {sum_squared_values(residuals(model))}
model_list <- list(m0, m1, m2, m3, m4, m5, m6, m7, m8, m9, m10)
fit_measures <- tibble(
model_number = 0:10,
sum_squared_residuals = sapply(model_list, sum_squard_residuals),
r_squared = sapply(model_list, rsquared))
Because we know the true population model, we can also calculate the proportion of variability in the Y variable that can be accounted for using the true generating process. This will give us a measure of how much “signal” about the population process is potentially available to be extracted from this data if we had the perfect model.
true_sum_squared_residuals <- sum(~(epsilon_true^2), data = FakeData)
true_rsquared <- cor(Y ~ Y_hat_true, data = FakeData)^2
Let’s plot the fitted sums of squared residuals and \(R^2\) values for our eleven models as a function of how many predictors they use (with the number that are actually relevant for the population indicated by a vertical dotted line), and compare to the \(R^2\) we’d obtain using the optimal model (represented by the horizontal dashed line).
gf_point(sum_squared_residuals ~ model_number, data = fit_measures) +
scale_x_continuous(name = "Number of Predictors Used", breaks = 0:10) +
scale_y_continuous("Sum of Squared Residuals") +
geom_hline(yintercept = true_sum_squared_residuals, linetype = 2) +
geom_vline(xintercept = 3, linetype = 3)
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
gf_point(r_squared ~ model_number, data = fit_measures) +
scale_x_continuous(name = "Number of Predictors Used", breaks = 0:10) +
scale_y_continuous(expression(R^2), limits = c(0,1), breaks = seq(0,1,0.1)) +
geom_hline(yintercept = true_rsquared, linetype = 2) +
geom_vline(xintercept = 3, linetype = 3)
Does it actually matter in practice if we use a model that includes irrelevant predictors? Also, is this artificial example really meaningful, since in the real world most things are at least somewhat correlated with most other things, so is there really such a thing as a completely useless predictor?
There are a couple of reasons why using models that include unnecessary or only marginally relevant predictors might not be desirable.
For one, adding unnecessary or barely useful complexity makes our models more difficult to interpret.
But even if all we care about is predictions and not interpretability, another reason it can be harmful to add irrelevant or very weakly relevant predictors is that it can actually cause our prediction performance to degrade.
But wait, didn’t we just see that \(R^2\), which captures our model’s ability to make predictions close to the response values, keeps going up when we add predictors?
Well, yes, as long as we’re measuring it on the data we used to fit the model.
There’s no guarantee that this same thing will happen if we try to use the model to make predictions for new datapoints.
Let’s generate a second dataset using the same predictor values and same true population model (so the true “Y hat” values are the same too). The only thing that will change is the residuals, and therefore also the sample response values.
set.seed(29747)
TestData <- Predictors %>%
mutate(
Y_hat_true = beta1 * X1 + beta2 * X2 + beta3 * X3,
epsilon_true = rnorm(N, mean = 0, sd = 1),
Y = Y_hat_true + epsilon_true)
Here’s what our “test” data looks like.
## X1 X2 X3 X4 X5
## 1 -0.06723632 0.344367097 -1.260985237 1.1394641 -1.22178192
## 2 0.48352186 -0.003025539 -0.008930402 0.5933576 -0.09920208
## 3 0.14027895 0.662125963 1.131039665 -0.4751120 0.85241411
## 4 -0.14679568 1.183290044 0.419156068 -0.7710384 0.71452770
## 5 0.23099272 0.209720171 1.474832991 2.0802482 0.14841417
## 6 -0.12422362 -0.019105983 -0.529369588 0.8234882 -0.09490660
## 7 0.72906491 -1.308636074 -0.571149276 0.7641005 0.92692699
## 8 -2.45001626 0.564470520 0.031525945 1.8675152 -2.43932045
## 9 -0.58778361 -1.039416308 0.155512893 -0.1596948 1.85649268
## 10 0.80465245 0.543004777 -0.836353314 -0.2528785 1.40328134
## 11 0.71583781 -0.965897366 -0.638773126 -1.2241052 -0.50995957
## 12 -1.62122610 -0.938844874 1.141673229 0.5780161 0.61579979
## 13 0.10937685 0.736209672 2.151278838 -2.1621296 0.03722538
## 14 0.92418032 0.133799711 -0.765966795 1.5728608 -0.30395797
## 15 -1.88670515 0.477533977 0.389948104 -0.1341345 -0.47311218
## X6 X7 X8 X9 X10 Y_hat_true
## 1 1.57331589 0.07347787 -1.17511509 -1.58826190 -0.74738073 -0.6370468
## 2 -0.23803425 0.04778266 0.29651274 -0.83380992 -1.37397000 -0.3338917
## 3 -0.75151885 -0.17566620 0.75196443 0.23721090 1.26749634 0.4745769
## 4 0.10010522 1.03740917 0.84239825 0.09036045 0.01972571 0.2703376
## 5 1.71852758 1.03921216 -1.18160387 1.02843119 0.04647717 0.6146054
## 6 -0.41965428 -0.77916316 0.04815703 0.21454715 0.18274507 -0.1949189
## 7 -0.55190791 0.20261885 0.43240423 0.18691659 -0.92736819 -0.7419092
## 8 0.58528027 1.24595327 -1.07168921 -0.34279289 0.04284334 1.6606132
## 9 1.30870356 -1.62019637 1.68940879 -0.04145376 1.05665648 0.5280019
## 10 0.24342600 -1.08617858 1.41389620 1.55033432 -0.72895214 -1.0146129
## 11 1.14655152 -1.33544032 -0.95273673 0.17503179 0.01635606 -0.7836659
## 12 0.95300884 -0.87182069 -0.55269482 -0.31294268 -0.31854207 1.7496908
## 13 -0.01748769 -1.15753651 0.60970025 1.16305930 -0.22062731 1.0327065
## 14 0.71335595 -1.14783543 0.46421474 1.70179171 0.14326326 -1.0408879
## 15 -0.37919602 -1.16207650 1.22391159 0.01205071 -0.85406964 1.4705918
## epsilon_true Y
## 1 0.06331548 -0.5737313
## 2 1.74351271 1.4096210
## 3 -0.28378807 0.1907889
## 4 -0.71347644 -0.4431389
## 5 0.23891517 0.8535206
## 6 1.87388159 1.6789627
## 7 -0.45193378 -1.1938429
## 8 0.97842021 2.6390334
## 9 -1.36031967 -0.8323178
## 10 -1.82082452 -2.8354374
## 11 0.02899077 -0.7546751
## 12 -1.25768082 0.4920099
## 13 0.84769610 1.8804026
## 14 0.07001422 -0.9708736
## 15 -1.43786452 0.0327273
Let’s see how good a job each of the models we fit on the first dataset can do at predicting the Y values in the second dataset. The following code will generate predicted values for each of the eleven models using the new test dataset.
predictions_by_model <- sapply(model_list, predict, newdata = TestData)
colnames(predictions_by_model) <- paste0("m",0:10)
Now let’s compare those predictions to the actual Y values in the test data.
generalization_errors_by_model <- with(TestData, Y - predictions_by_model) %>% as_tibble()
generalization_sse_by_model <- generalization_errors_by_model %>%
summarize_all(sum_squared_values)
Here are the sums of squared errors that each model obtains, using the original dataset to estimate coefficients, making predictions for the test dataset.
## # A tibble: 1 x 11
## m0 m1 m2 m3 m4 m5 m6 m7 m8 m9 m10
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 30.6 23.7 24.6 21.9 20.7 32.1 31.1 31.2 31.3 30.5 31.3
And here’s a graph of the same:
library(olsrr)
fit_measures <- fit_measures %>%
mutate(
generalization_sse = t(generalization_sse_by_model),
generalization_rsquared = sapply(model_list, ols_pred_rsq)
)
gf_point(generalization_sse ~ model_number, data = fit_measures) +
scale_x_continuous(name = "Number of Predictors Used", breaks = 0:10) +
scale_y_continuous("Sum of Squared Residuals") +
geom_hline(yintercept = true_sum_squared_residuals, linetype = 2) +
geom_vline(xintercept = 3, linetype = 3)