Experiment with variables of given high correlation structure¶

This notebook is meant to address to a shared concern from two referees. The motivating example in the manuscript was designed to be a simple toy for illustrating the novel type of inference SuSiE offers. Here are some slightly more complicated examples, based on the motivating example, but with variables in high (rather than perfect) correlations with each other.

$x_1$ and $x_2$ are highly correlated¶

Following a reviewer's suggestion, we simulated two variables, $x_1$ and $x_2$, with high but not perfect correlation ($0.9$). Specifically, we simulated $n = 600$ samples stored as an $X_{600 \times 2}$ matrix, in which each row was drawn i.i.d. from a normal distribution with mean zero and $\mathrm{cor}(x_1, x_2) = 0.9$.

We then simulated $y_i = x_{i1} \beta_1 + x_{i2} \beta_2 + \varepsilon_i$, with $\beta_1 = 1, \beta_2 = 0$, and $\varepsilon_i$ i.i.d. normal with zero mean and standard deviation of 3. We performed 1,000 replicates of this simulation (generated with different random number seeds).

In this simulation, the correlation between $x_1$ and $x_2$ is still sufficiently high (0.9) to make distinguishing between the two variables somewhat possible, but not non entirely straightforward. For example, when we run lasso (using cv.glmnet from the glmnet R package) on these data it wrongly selected $x_2$ as having non-zero coefficient in about 10% of the simulations (95 out of 1,000), and correctly selected $x_1$ in about 96% of simulations (956 out of 1,000). Note that the lasso does not assess uncertainty in variable selection, so these results are not directly comparable with SuSiE CSs below; however, the lasso results demonstrate that distinguishing the correct variable here is possible, but not so easy that the example is uninteresting.

Ideally, then, SuSiE should identify variable $x_1$ as an effect variable and drop $x_2$ as often as possible. However, due to the high correlation between the variables, it is inevitable that some 95% SuSiE credible sets (CS) will also contain $x_2$. Most important is that we should avoid, as much as possible, reporting a CS that contains only $x_2$, since the goal is that 95% of CSs should contain at least one effect variable. The SuSiE results (SuSiE version 0.9.1 on R 3.5.2) are summarized below. The code used for the simulation can be found here.

CSs	count
(1)	829
(1,2)	169
(2)	2

Highlighted in bold are CSs that do not contain the true effect variable --- there are 2 of them out of 1,000 CSs detected. In summary, SuSiE precisely identifies the effect variable in a single CS in the majority (83%) of the simulations, and provides a "valid" CS (i.e., one containing an effect variable) in almost all simulations (998 out of 1,000). Further, even when SuSiE reports a CS including both variables, it consistently assigns higher posterior inclusion probability (PIP) to the correct variable, $x_1$: among the 169 CSs that contain both variables, the median PIPs for $x_1$ and $x_2$ were 0.86 and 0.14, respectively.

When an additional non-effect variable highly correlated with both variable groups¶

Another referee suggested the following:

Suppose we have another predictor $x_5$, which is both correlated with $(x_1, x_2)$ and $(x_3, x_4)$. Say $\mathrm{cor}(x_1, x_5) = 0.9$, $\mathrm{cor}(x_2, x_5) = 0.7$, and $\mathrm{cor}(x_5, x_3) = \mathrm{cor}(x_5, x_4) = 0.8$. Does the current method assign $x_5$ to the $(x_1, x_2)$ group or the $(x_3, x_4)$ group?

Following the suggestion, we simulated $x_1, \ldots, x_5$ from a multivariate normal with zero mean and the covariance matrix approximately as given in Table below. (Since this matrix is not quite positive definite, in our R code we used nearPD from the Matrix package to generate the nearest positive definite matrix --- the entries of the resulting covariance matrix differ only very slightly from those in Table below, with a maximum absolute difference of 0.0025 between corresponding elements in the two matrices).

	$x_1$	$x_2$	$x_3$	$x_4$	$x_5$
$x_1$	1.00	0.92	0.70	0.70	0.90
$x_2$	0.92	1.00	0.70	0.70	0.70
$x_3$	0.70	0.70	1.00	0.92	0.80
$x_4$	0.70	0.70	0.92	1.00	0.80
$x_5$	0.90	0.70	0.80	0.80	1.00

We simulated $n = 600$ samples from this multivariate normal distribution, then we simulated $n = 600$ responses $y_i$ from the regression model $y_i = x_{i1} \beta_1 + \cdots x_{i5} \beta_5 + \varepsilon_i$, with $\beta = (0, 1, 1, 0, 0)^T$, and $\varepsilon_i$ i.i.d. normal with zero mean and standard deviation of 3. We repeated this simulation procedure 1,000 times with different random seeds, and each time we fit a SuSiE model to the simulated data by running the IBSS algorithm. To simplify the example, we ran the IBSS algorithm with $L = 2$, and fixed the $\sigma_0^2 = 1$. Similar results were obtained when we used larger values of $L$, and when $\sigma_0^2$ was estimated. For more details on how the data were simulated and how the SuSiE models were fitted to the data sets, see this script.

Like the toy motivating example given in the paper, in this simulation the first two predictors are strongly correlated with each other, so it may be difficult to distinguish among them, and likewise for the third and fourth predictors. The fifth predictor, which has no effect on $y$, potentially complicates matters because it is also strongly correlated with the other predictors. Despite this complication, our basic goal remains the same: the Credible Sets inferred by SuSiE should capture the true effects most of the time, while also minimizing "false positive" CSs that do not contain any true effects. (Further, each CS should, ideally, be as small as possible.)

Table below summarizes the results of these simulations: the left-hand column gives a unique result (a combination of CSs), and the right-hand column gives the number of times this unique result occurred among the 1,000 simulations. The CS combinations are ordered by the frequency of their occurrence in the simulations. We highlight in bold CSs that do not contain a true effect.

CSs	count
(2), (3)	551
(2), (3,4)	212
(1,2), (3)	176
(1,2), (3,4)	38
(2), (3,4,5)	9
(1), (3,4)	3
(2), (4)	3
(1,2), (3,4,5)	2
(1), (3)	1
(1,2), (4)	1
(2), (3,5)	1
(3), (1,2,5)	1
(3), (1,2,3)	1
(3,4), (1,2,4)	1

In the majority (551) of the simulations, SuSiE precisely identiied the true effect variables, and no others. In most other cases, SuSiE identified two CSs, each containing a correct effect variable, and with one or more other variables included due to high correlation with the true-effect variable. The referee asks specifically about how the additional variable $x_5$ behaves in this example. In practice, $x_5$ was rarely included in a CS. In the few cases where $x_5$ was included in a CS, the results were consistent with the simulation setting; $x_5$ was included more frequently with $x_3$ and/or $x_4$ (12 times) rather than $x_2$ and/or $x_1$ (only once). In no simulations did SuSiE form a large group that contains all five predictors.

This example actually highlights the benefits of SuSiE compared to alternative approaches (e.g., hierinf) that first cluster the variables into groups based on the correlation structure, then test the groups. As we pointed out in the manuscript, this alternative approach (first cluster variables into groups, then test groups) would work well in the toy example in the paper, but in general it requires ad hoc decisions about how to cluster variables. In this more complex example raised by the referee, it is far from clear how to cluster the variables. SuSiE avoids this problem because there is no pre-clustering of variables; instead, the SuSiE CSs are computed directly from an (approximate) posterior distribution (which takes into account how the variables $x$ are correlated with each other, as well as their relationship with $y$).