Here we investigate "selective inference" in the toy example of Wang et al (2018). We show that the approach will sometimes select the wrong variables -- which is inevitable in cases where variables are perfectly correlated -- and then assign them highly significant $p$ values. This is because even though the wrong variables are selected, their coefficients within the wrong model can be estimated precisely.
knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold")
First, load the selective inference package.
library(selectiveInference)
Now simulate some data with $x_1 = x_2$ and $x_3 = x_4$, and with effects at variables 1 and 3. (We simulate $p = 100$ variables rather than $p = 1000$ so that the example runs faster.)
set.seed(15)
n <- 500
p <- 100
x <- matrix(rnorm(n*p),n,p)
x[,2] <- x[,1]
x[,4] <- x[,3]
b <- rep(0,p)
b[1] <- 1
b[4] <- 1
y <- drop(x %*% b + rnorm(n))
Unfortunately, the selective inference methods won't allow duplicate columns.
try(fsfit <- fs(x,y))
try(larfit <- lar(x,y))
So we modify x
so that the identical columns aren't quite identical.
x[,2] <- x[,1] + rnorm(n,0,0.1)
x[,4] <- x[,3] + rnorm(n,0,0.1)
cor(x[,1],x[,2])
cor(x[,3],x[,4])
Now run the forward selection again, computing sequential p-values and confidence intervals.
fsfit <- fs(x,y)
out <- fsInf(fsfit)
print(out)
From the above output, we see that the selective inference method selected variables 1 and 3 with very small p-values. Of course, we know that variable 3 is a false selection, so it might seem bad that the p-value is small. However, you have to remember that p-values do not measure significance of variable selection---they measure the significance of the coefficient of the selected variable, conditional on the selection event.
Put another way, selective inference is not trying to assess uncertainty in which variables should be selected, and is certainly not trying to produce inferences of the form $$(b_1 \neq 0 \text{ OR } b_2 \neq 0) \text{ AND } (b_3 \neq 0 \text{ OR } b_4 \neq 0),$$ which was the goal of Wang et al (2018).
Exported from manuscript_results/selective_inference_toy.Rmd
committed by Gao Wang on Thu Jul 11 11:20:51 2019 revision 5, 52efa03
Exported from manuscript_results/selective_inference_toy.Rmd
committed by Gao Wang on Thu Jul 11 11:20:51 2019 revision 5, 52efa03
Exported from manuscript_results/selective_inference_toy.Rmd
committed by Gao Wang on Thu Jul 11 11:20:51 2019 revision 5, 52efa03
Exported from manuscript_results/selective_inference_toy.Rmd
committed by Gao Wang on Thu Jul 11 11:20:51 2019 revision 5, 52efa03
Exported from manuscript_results/selective_inference_toy.Rmd
committed by Gao Wang on Thu Jul 11 11:20:51 2019 revision 5, 52efa03
Exported from manuscript_results/selective_inference_toy.Rmd
committed by Gao Wang on Thu Jul 11 11:20:51 2019 revision 5, 52efa03