Generating Synthetic Data From a Known Population Process

First we will generate synthetic data, again from a known model, so we have something to judge the performance of our metrics against.

The ‘ground truth’ model is the following cubic polynomial:

\[Y_i = 1 + 0.5 X_i - 0.5 X_i^2 + 0.25 X_i^3 + \varepsilon\]

where there is a single predictor variable X, and the residuals satisfy the usual regression conditions:

\[\varepsilon_i \sim N(0, 4)\]

The following script will generate the data as FakePolyData and plot the true function through the noised data

Fitting Polynomial Models to the Generated Data

Now we’ll fit a series of 10 polynomial models to this data, ranging from degree 1 to degree 9.

The code below uses a function called fit_polynomial_model() that I wrote for this demo (the source code is in the helper_functions.R script that we loaded with source() above)

Let’s look at the coefficients from one of the models:

##                      (Intercept) poly(x, degree = 3, raw = TRUE)1 
##                             0.96                             2.70 
## poly(x, degree = 3, raw = TRUE)2 poly(x, degree = 3, raw = TRUE)3 
##                            -0.59                             0.13

The ‘ground truth’ model is the following cubic polynomial:

\[Y_i = 1 + 0.5 X_i - 0.5 X_i^2 + 0.25 X_i^3 + \varepsilon_i\]

The fitted cubic model is:

\[Y_i = 0.96 + 2.70 X_i - 0.59 X_i^2 + 0.13 X_i^3 + \varepsilon_i\]

which looks like this, plotted against the data:

Aside: We could also have done this in one line using R’s handy lapply() function:

(the code above calls fit_polynomial_model(), setting the first argument first to 1, then to 2, then to 3, etc., keeping the linear_formula= and data= arguments constant, and returns the results of all nine calls of the function in a list)

Let’s plot the nine fitted model curves on the data. The function plot_data_with_polynomial_model() is also a custom one I wrote for this demo, and is defined in helper_functions.R. Again, for conciseness I’m calling the function ten times all at once by using lapply(), but we could call it one by one.

Evaluating the Models Using In-Sample Fit Metrics

Next let’s calculate some fit metrics, such as \(R^2\) and adjusted \(R^2\), for these ten models. The rsquared_adj() function is defined in helper_functions.R.

Let’s see how regular and adjusted \(R^2\) compare:

Unadjusted \(R^2\) of course gets higher and higher as we add complexity. Adjusted \(R^2\) maxes out round polynomial degree 5 or 6 and then falls off.

Recall that the true generating function was a third order polynomial though, so this is still overfitting to some extent.

Evaluating The Models Using Out of Sample Validation

We’re going to build up to cross-validation, but we’ll go one step at a time so you can see what we’re doing at each step.

The code below is more step-by-step than you’d really need – I’ll show some ‘all-in-one’ code at the end, but it’s useful to go through it step-by-step to understand what’s going on.

Simple Validation

Before we do cross-validation, let’s try just doing validation with a single training set and a single validation set.

First, we split the data randomly, taking 10% case as the validation set (since this dataset only has 12 datapoints, the validation set will only have 1).

Now that we’ve split the data, let’s fit some models on the training set.

and plot them with the full dataset:

Can you guess which point was held out?

Let’s plot the data with just the validation point:

Now we’ll compute the mean squared prediction error for the held out point for each of the models. The sapply() function below does the same thing as lapply(), it just “simplifies” the list of results to a format that fits in a column of a data table. We’re appending these generalization error measures to the table of other fit metrics we calculated before.

Here’s the Mean Squared (Generalization) Error for the 9 models:

As we saw from the plots, the 9th order model is doing terribly.

The scale of the other values is tight, so let’s plot the square root of the mean squared error instead.

So for this particular validation fold, the 7th order polynomial generalizes the best, though the 4th order gets very close.

Compare this to the mean squared prediction error on the training set itself:

Cross-Validation

The above is based on just one fixed validation point. What about cross-validation, where we let each point have a turn as the validation set?

The do_k_fold_cv() function will split the dataset you supply into K folds (for the value of K you supply as an argument), and do what we did above with each fold taking its turn as the validation set. It will calculate the error metric you supply for each of the validation folds, and return the average result.

Here are the results for model 5:

## [1] 63.4288
## [1] 7.96422

We can compute this for all of our different models and plot the results:

Now let’s plot the resulting error measures.

It’s kind of hard to see the differences among the smaller values due to how massively larger the errors are for the most complex models, so let’s just chop those off…

And lo and behold, the 3rd order polynomial does the best! (We may have gotten a bit lucky here; this won’t always return the “best” model by any means, but it’s about as good a method as we have)

We don’t actually have a test set in this case, but if we did we could evaluate prediction error for the final model on the test set like this.