In this short analysis, we compare the prediction accuracy of several linear regression in the four simulation examples of Zou & Hastie (2005). We also include two additional scenarios, similar to Examples 1 and 2 from Zou & Hastie (2005): a “null” scenario, in which the predictors have no effect on the outcome; and a “one effect” scenario, in which only one of the predictors affects the outcome.

The six methods compared are: (1) ridge regression; (2) the Lasso; (3) the Elastic Net; (4) “Sum of Single Effects” (SuSiE) regression, described here; (5) variational inference for Bayesian variable selection, or “varbvs”, described here; and (6) “varbvsmix”, an elaboration of varbvs that replaces the single normal prior with a mixture-of-normals.

Load packages

Load a few packages and custom functions used in the analysis below.


Import DSC results

Here we use function “dscquery” from the dscrutils package to extract the DSC results we are interested in—the mean squared error in the predictions from each method and in each simulation scenario.

methods <- c("ridge","lasso","elastic_net","susie","varbvs","varbvsmix")
dsc <- dscquery("../dsc/linreg",
                verbose = FALSE)
dsc <- transform(dsc,
                 simulate          = factor(simulate),
                 simulate.scenario = factor(simulate.scenario),
                 fit               = factor(fit,methods))
names(dsc)[1] <- "seed"
# [1] 7200

After this call, the “dsc” data frame should contain results for 7,200 pipelines—6 methods times 6 scenarios times 200 data sets simulated in each scenario.

# [1] 7200

After these steps, the “dsc” data frame should have five columns: “seed”, the seed used to simulate the data; “simulate”, the data simulation module used; “simulate.scenario”, the particular scenario used in the “zh” simulation module; “fit”, the linear regression method used; and “mse.err”, the mean squared error in the test set predictions.

#   seed     simulate   fit   mse.err simulate.scenario
# 1    1 null_effects ridge 12.322653              <NA>
# 2    2 null_effects ridge 12.695073              <NA>
# 3    3 null_effects ridge 10.625812              <NA>
# 4    4 null_effects ridge 10.267211              <NA>
# 5    5 null_effects ridge 10.429469              <NA>
# 6    6 null_effects ridge  9.940155              <NA>

Note that you will need to run the DSC before querying the results; see here for instructions on running the DSC. If you did not run the DSC to generate these results, you can replace the dscquery call above by this line to load the pre-extracted results stored in a CSV file:

dsc <- read.csv("../output/linreg_mse.csv")

This is how the CSV file was created:

write.csv(dsc,"../output/linreg_mse.csv",row.names = FALSE,quote = FALSE)

Summarize and discuss simulation results

Compute the mean squared error (MSE) in the predictions relative to ridge regression, so that larger numbers mean greater error relative to predictions from ridge regressions.

rmse <- compute.relative.mse(dsc)
dsc  <- cbind(dsc,rmse)

The boxplots below summarize the prediction errors in each of the simulations. A relative MSE less than 1 indicates an improvement in accuracy over ridge regression, whereas a relative MSE greater than 1 indicates a decrease in accuracy compared to ridge regression. Ridge regression will always have a relative MSE of 1, so the results for ridge regression are not shown.

p1 <- rmse.boxplot(subset(dsc,simulate == "null_effects"),"null")
p2 <- rmse.boxplot(subset(dsc,simulate == "one_effect"),"one effect")
p3 <- rmse.boxplot(subset(dsc,simulate.scenario == 1),"scenario 1")
p4 <- rmse.boxplot(subset(dsc,simulate.scenario == 2),"scenario 2")
p5 <- rmse.boxplot(subset(dsc,simulate.scenario == 3),"scenario 3")
p6 <- rmse.boxplot(subset(dsc,simulate.scenario == 4),"scenario 4")
p  <- plot_grid(p1,p2,p3,p4,p5,p6)

Here are a few initial impressions from these plots.

In most cases, the Elastic Net does at least as well, or better, than the Lasso. This is what we would expect.

Ridge regression actually achieves good accuracy in all cases except Scenario 4 and the “one effect” setting. Ridge regression is expected to do less well in Scenario 4 because the majority of the true coefficients are zero, so a sparse model would be favoured. Similarly, a sparse model should better fit data simulated in the “one effect” scenario.

In Scenario 4, where the predictors are correlated in a structured way, and the effects are sparse, varbvs and varbvsmix perform considerably better than the other methods.

The “varbvsmix” method yields competitive predictions in all scenarios.

Session information

The “Session information” button below gives the version of R and the packages that were used to generate these results. This listing includes the R packages that were also used to run the DSC.

