Selective inference for a toy example¶

Here we investigate "selective inference" in the toy example of Wang et al (2018). We show that the approach will sometimes select the wrong variables -- which is inevitable in cases where variables are perfectly correlated -- and then assign them highly significant $p$ values. This is because even though the wrong variables are selected, their coefficients within the wrong model can be estimated precisely.

knitr::opts_chunk$set(comment = "#",collapse = TRUE,results = "hold")

Load packages¶

First, load the selective inference package.

library(selectiveInference)

Loading required package: glmnet
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-18

Loading required package: intervals

Attaching package: ‘intervals’

The following object is masked from ‘package:Matrix’:

    expand

Loading required package: survival
Loading required package: adaptMCMC
Loading required package: parallel
Loading required package: coda
Loading required package: MASS

Simulate data¶

Now simulate some data with $x_1 = x_2$ and $x_3 = x_4$, and with effects at variables 1 and 3. (We simulate $p = 100$ variables rather than $p = 1000$ so that the example runs faster.)

set.seed(15)
n     <- 500
p     <- 100
x     <- matrix(rnorm(n*p),n,p)
x[,2] <- x[,1]
x[,4] <- x[,3]
b     <- rep(0,p)
b[1]  <- 1
b[4]  <- 1
y     <- drop(x %*% b + rnorm(n))

Run selective inference¶

Unfortunately, the selective inference methods won't allow duplicate columns.

try(fsfit <- fs(x,y))
try(larfit <- lar(x,y))

Error in checkargs.xy(x = x, y = y) : x cannot have duplicate columns
Error in checkargs.xy(x = x, y = y) : x cannot have duplicate columns

So we modify x so that the identical columns aren't quite identical.

x[,2] <- x[,1] + rnorm(n,0,0.1)
x[,4] <- x[,3] + rnorm(n,0,0.1)
cor(x[,1],x[,2])
cor(x[,3],x[,4])

Now run the forward selection again, computing sequential p-values and confidence intervals.

fsfit <- fs(x,y)
out   <- fsInf(fsfit)
print(out)

Call:
fsInf(obj = fsfit)

Standard deviation of noise (specified or estimated) sigma = 0.995

Sequential testing results with alpha = 0.100
 Step Var   Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
    1   1  0.900  20.573   0.003     0.441    0.953       0.050      0.049
    2   3  0.956  21.494   0.000     0.900    1.397       0.049      0.050
    3  37  0.099   2.133   0.539    -0.439    0.153       0.050      0.049
    4  45  0.083   1.817   0.911      -Inf    0.204       0.000      0.050
    5   4 -0.838  -1.850   0.215      -Inf   10.125       0.000      0.050
    6  46  0.077   1.747   0.715    -3.695    1.192       0.050      0.050
    7  64 -0.077  -1.718   0.409    -3.906    2.666       0.050      0.050
    8  79  0.076   1.678   0.542    -3.007    2.405       0.050      0.050
    9  43  0.080   1.721   0.366    -1.354    2.373       0.050      0.050
   10  68 -0.075  -1.669   0.456      -Inf      Inf       0.000      0.000
   11  13  0.081   1.738   0.294    -2.452      Inf       0.050      0.000
   12  61 -0.075  -1.588   0.254    -3.182    0.957       0.050      0.050
   13  56 -0.064  -1.452   0.293    -3.213    1.219       0.050      0.050
   14  84  0.058   1.332   0.934      -Inf    0.164       0.000      0.050
   15  90  0.060   1.324   0.528    -2.526    2.161       0.050      0.050
   16  65 -0.057  -1.284   0.183      -Inf    2.364       0.000      0.050
   17  69 -0.056  -1.243   0.407    -3.554    2.394       0.050      0.050
   18  74 -0.053  -1.177   0.998     4.469      Inf       0.010      0.000
   19  18  0.052   1.194   0.010     2.129      Inf       0.050      0.000
   20  70 -0.047  -1.097   0.047      -Inf   -0.130       0.000      0.050
   21  49 -0.049  -1.075   0.135      -Inf    2.211       0.000      0.050
   22  88  0.055   1.118   0.918      -Inf    1.226       0.000      0.050
   23  25  0.049   1.055   0.060    -0.614      Inf       0.050      0.000
   24  27 -0.047  -1.055   0.909    -0.467      Inf       0.050      0.000
   25  96  0.051   1.120   0.861      -Inf      Inf       0.000      0.000
   26  39  0.048   1.029   0.908      -Inf    3.127       0.000      0.050
   27  40 -0.050  -1.041   0.001      -Inf   -4.787       0.000      0.009
   28  17 -0.045  -0.968   0.713    -1.456    4.322       0.050      0.050
   29  66 -0.041  -0.901   0.942    -0.162      Inf       0.050      0.000
   30  85  0.042   0.916   0.508      -Inf      Inf       0.000      0.000
   31   8 -0.044  -0.932   0.609    -2.522    4.229       0.050      0.050
   32  22 -0.044  -0.933   0.155      -Inf    3.595       0.000      0.050
   33  78  0.041   0.933   0.265    -2.083      Inf       0.050      0.000
   34  28 -0.041  -0.882   0.789    -4.262      Inf       0.050      0.000
   35   7  0.043   0.896   0.914      -Inf    0.491       0.000      0.050
   36  26  0.044   0.905   0.620      -Inf      Inf       0.000      0.000
   37  21 -0.041  -0.869   0.585      -Inf      Inf       0.000      0.000
   38  71  0.043   0.940   0.181    -1.138      Inf       0.050      0.000
   39  73  0.040   0.858   0.312      -Inf      Inf       0.000      0.000
   40  50 -0.041  -0.867   0.963     4.773      Inf       0.048      0.000
   41  99  0.041   0.876   0.002     4.638      Inf       0.020      0.000
   42  59  0.036   0.790   0.544      -Inf    3.756       0.000      0.050
   43  76  0.034   0.738   0.432    -3.622      Inf       0.050      0.000
   44  62 -0.037  -0.781   0.325    -3.570    1.584       0.050      0.050
   45  19  0.031   0.717   0.794      -Inf    2.280       0.000      0.050
   46  36 -0.033  -0.711   0.557      -Inf      Inf       0.000      0.000
   47   6  0.030   0.684   0.910      -Inf      Inf       0.000      0.000
   48  34  0.034   0.725   0.036     1.862      Inf       0.050      0.000
   49  57 -0.035  -0.740   0.701    -2.243      Inf       0.050      0.000
   50  42 -0.034  -0.754   0.126      -Inf    0.621       0.000      0.050
   51  14 -0.032  -0.692   0.660    -1.971    4.268       0.050      0.050
   52  53  0.030   0.665   0.822      -Inf    1.874       0.000      0.050
   53  83 -0.028  -0.627   0.218      -Inf    2.913       0.000      0.050
   54  35  0.029   0.620   0.597    -4.158    2.645       0.050      0.050
   55  12  0.027   0.594   0.725      -Inf    3.415       0.000      0.050
   56   2 -0.285  -0.587   0.877      -Inf      Inf       0.000      0.000
   57  38  0.031   0.617   0.336      -Inf      Inf       0.000      0.000
   58  41 -0.026  -0.562   0.987     0.322      Inf       0.000      0.000
   59  51  0.027   0.583   0.023    -0.330      Inf       0.000      0.000
   60  91  0.030   0.644   0.943      -Inf    0.654       0.000      0.050
   61  58  0.027   0.574   0.992      -Inf   -2.291       0.000      0.050
   62  97  0.029   0.558   0.002     5.116      Inf       0.004      0.000
   63  87  0.025   0.523   0.811      -Inf      Inf       0.000      0.000
   64   5  0.025   0.509   0.108    -2.311      Inf       0.050      0.000
   65  63  0.022   0.485   0.655      -Inf    2.721       0.000      0.050
   66  47  0.022   0.455   0.957      -Inf   -0.337       0.000      0.044
   67   9  0.021   0.459   0.675      -Inf    4.098       0.000      0.050
   68  31  0.021   0.448   0.064    -1.444      Inf       0.050      0.000
   69  89 -0.021  -0.426   0.928    -0.992      Inf       0.050      0.000
   70  32 -0.022  -0.419   0.428      -Inf      Inf       0.000      0.000
   71  86 -0.021  -0.422   0.060      -Inf    0.371       0.000      0.000
   72  48 -0.020  -0.402   0.675    -2.214      Inf       0.050      0.000
   73  20 -0.017  -0.360   0.789    -4.532      Inf       0.050      0.000
   74  67 -0.018  -0.373   0.142      -Inf    2.074       0.000      0.050
   75  44  0.017   0.348   0.363    -2.758      Inf       0.050      0.000
   76  52  0.016   0.329   0.395    -3.551      Inf       0.050      0.000
   77  11  0.015   0.309   0.942      -Inf    0.331       0.000      0.050
   78  23  0.014   0.308   0.136      -Inf      Inf       0.000      0.000
   79  94  0.014   0.313   0.596      -Inf      Inf       0.000      0.000
   80  30  0.014   0.287   0.463      -Inf      Inf       0.000      0.000
   81  95  0.013   0.264   0.509      -Inf      Inf       0.000      0.000
   82  33  0.012   0.251   0.553      -Inf      Inf       0.000      0.000
   83  15 -0.011  -0.226   0.790      -Inf      Inf       0.000      0.000
   84  98  0.011   0.235   0.092    -1.107      Inf       0.050      0.000
   85  72 -0.010  -0.205   0.496    -3.210    3.169       0.050      0.050
   86 100  0.008   0.161   0.889      -Inf    1.150       0.000      0.050
   87  81 -0.008  -0.161   0.157      -Inf    2.318       0.000      0.050
   88  24  0.007   0.135   0.824      -Inf    1.919       0.000      0.050
   89  10  0.006   0.123   0.611      -Inf      Inf       0.000      0.000
   90  60  0.006   0.120   0.376      -Inf      Inf       0.000      0.000
   91  75 -0.005  -0.113   0.386      -Inf      Inf       0.000      0.000
   92  16 -0.004  -0.086   0.642      -Inf      Inf       0.000      0.000
   93  54  0.004   0.078   0.999      -Inf   -5.340       0.000      0.002
   94  29  0.004   0.077   0.905      -Inf      Inf       0.000      0.000
   95  82  0.004   0.078   0.000     4.839      Inf       0.001      0.000
   96  80  0.000  -0.008   0.995     4.626      Inf       0.032      0.000
   97  93  0.000   0.008   0.191      -Inf      Inf       0.000      0.000
   98  77  0.000  -0.007   0.300      -Inf      Inf       0.000      0.000
   99  55  0.000  -0.004   0.716      -Inf      Inf       0.000      0.000
  100  92  0.000  -0.003   0.230      -Inf      Inf       0.000      0.000

Estimated stopping point from ForwardStop rule = 2

Summary¶

From the above output, we see that the selective inference method selected variables 1 and 3 with very small p-values. Of course, we know that variable 3 is a false selection, so it might seem bad that the p-value is small. However, you have to remember that p-values do not measure significance of variable selection---they measure the significance of the coefficient of the selected variable, conditional on the selection event.

Put another way, selective inference is not trying to assess uncertainty in which variables should be selected, and is certainly not trying to produce inferences of the form $$(b_1 \neq 0 \text{ OR } b_2 \neq 0) \text{ AND } (b_3 \neq 0 \text{ OR } b_4 \neq 0),$$ which was the goal of Wang et al (2018).