This document explains with toy example illustration the unique type of inference SuSiE is interested in. For a more complicated where variables are in high but not perfect correlation please see this notebook.
We assume our audience are familiar or interested in large scale regression. Similar to eg LASSO, SuSiE is a method for variable selection in large scale regression. Yet the type of inference SuSiE attempts to accomplish is different from other large scale regression methods, and is worth motivating in this document for those not familiar with this particular type of inference.
The type of inference is motivated by the so-called "genetic fine-mapping" study. Consider fitting the regression model $$y = \sum_{j=1}^p x_j\beta_j + \epsilon \quad \epsilon \sim N(0, \sigma^2I_n)$$ where $x_1 = x_2, x_3 = x_4$ and $\beta_1 \ne 0, \beta_4 \ne 0, \beta_{j \notin \{1,4\}} = 0$. Our goal is to make a statement that $$\beta_1 \ne 0 \text{ or } \beta_2 \ne 0, \text{ and } \beta_3 \ne 0 \text{ or } \beta_4 \ne 0,$$
that is, in SuSiE variable selection we want to capture the fact that variables $x_1$ and $x_2$ (likewise $x_3$ and $x_4$) are too similar to distinguish. This is an inference that to our knowledge few other large scale regression can make; at least not in an intuitive and interpretable fashion.
To illustrate here I set $p=1000$ and simulate a data-set:
set.seed(1)
n = 500
p = 1000
b = rep(0,p)
b[200] = 1
b[800] = 1
X = matrix(rnorm(n*p),nrow=n,ncol=p)
X[,200] = X[,400]
X[,600] = X[,800]
y = X %*% b + rnorm(n)
The "true" effects are:
pdf('truth.pdf', width =5, height = 5, pointsize=16)
plot(b, col="black", pch=16, main = 'True effect size')
pos = 1:length(b)
points(pos[b!=0],b[b!=0],col=2,pch=16)
dev.off()
%preview truth.pdf -s png --dpi 100
LASSO will simply choose one of:
This is not what we want. For instance on the simulated data,
alpha = 1
y.fit = glmnet::glmnet(X,y,alpha = alpha,intercept = FALSE)
y.cv = glmnet::cv.glmnet(X,y,alpha = alpha,intercept = FALSE,
lambda = y.fit$lambda)
bhat = glmnet::predict.glmnet(y.fit,type ="coefficients",
s = y.cv$lambda.min)[-1,1]
pdf('lasso.pdf', width =5, height = 5, pointsize=16)
plot(bhat, col="black", pch=16, main = 'Lasso')
pos = 1:length(bhat)
points(pos[b!=0],bhat[b!=0],col=2,pch=16)
dev.off()
%preview lasso.pdf -s png --dpi 100
LASSO randomly picked $x_1$ and $x_3$ even though the true non-zero effects are $\beta_1$ and $\beta_4$.
For sparse enough problems, most other Bayesian methods provide posterior probabilities on "models" by enumerating (CAVIAR), schochastic search (FINEMAP) or sampling (BIMBAM) from all possible combinations of variables. In our motivating example, the "true" posterior for models are \begin{align} p(\beta_1 \ne 0, \beta_2 = 0, \beta_3 \ne 0, \beta_4 = 0, \beta_{5:p} = 0) &= 0.25 \\ p(\beta_1 = 0, \beta_2 \ne 0, \beta_3 \ne 0, \beta_4 = 0, \beta_{5:p} = 0) &= 0.25 \\ p(\beta_1 \ne 0, \beta_2 = 0, \beta_3 = 0, \beta_4 \ne 0, \beta_{5:p} = 0) &= 0.25 \\ p(\beta_1 = 0, \beta_2 \ne 0, \beta_3 = 0, \beta_4 \ne 0, \beta_{5:p} = 0) &= 0.25. \end{align} Although compared to LASSO the model posterior contains the information to make statements about $\beta_1 \ne 0 \text{ or } \beta_2 \ne 0, \text{ and } \beta_3 \ne 0 \text{ or } \beta_4 \ne 0$, it is not explicitly provided. One has to post-process these posteriors in order to make the statement. Such pre-processing is non-trivial, see eg Stephens and Balding 2009 for an example in the context of genetic fine-mapping.
In addition, marginalized probabilities are often provided by these Bayesian methods, calcualted as \begin{align} p(\beta_1) &= p(\beta_1 \ne 0, \beta_2 = 0, \beta_3 \ne 0, \beta_4 = 0, \beta_{5:p} = 0) \\ & + p(\beta_1 \ne 0, \beta_2 = 0, \beta_3 = 0, \beta_4 \ne 0, \beta_{5:p} = 0) \\ & = 0.5, \end{align} but this does not provide the type of inference we are interested in.
We illustrate Bayesian variable selection using FINEMAP
. First we compute "summary statistics" that FINEMAP
requires,
library(abind)
mm_regression = function(X, Y, Z=NULL) {
if (!is.null(Z)) {
Z = as.matrix(Z)
}
reg = lapply(seq_len(ncol(Y)), function (i) simplify2array(susieR:::univariate_regression(X, Y[,i], Z)))
reg = do.call(abind, c(reg, list(along=0)))
# return array: out[1,,] is betahat, out[2,,] is shat
return(aperm(reg, c(3,2,1)))
}
sumstats = mm_regression(as.matrix(X), as.matrix(y))
dat = list(X=X,Y=as.matrix(y))
saveRDS(list(data=dat, sumstats=sumstats), '/tmp/Toy.with_sumstats.rds')
we then set up the path containing the finemap
executable (here set to ~/GIT/github/mvarbvs/dsc/modules/linux
) and run a wrapper script provided in the susieR
package under inst/code
folder (the package source can be downloaded here).
export PATH=~/GIT/github/mvarbvs/dsc/modules/linux:$PATH
Rscript ~/GIT/susieR/inst/code/finemap.R input=\"/tmp/Toy.with_sumstats.rds\" output=\"N2finemapping.FINEMAP\" args=\"--n-causal-max\ 2\" 2> /dev/null
Posterior probabilities for "models" are:
finemap = readRDS("N2finemapping.FINEMAP.rds")[[1]]
head(finemap$set)
FINEMAP
correctly determined 4 "model configurations" as expected, each with probability 0.25. Simply marginalizing across all models we obtain posterior inclusion probability per feature,
snp = finemap$snp
pip = snp[order(as.numeric(snp$snp)),]$snp_prob
pdf('finemap.pdf', width =5, height = 5, pointsize=16)
susieR::susie_plot(pip, y='PIP', b=b, main = 'Bayesian sparse regression')
dev.off()
%preview finemap.pdf -s png --dpi 100
From these marginalized posterior we can only make a statement that each of the 4 identified variables have 0.5 probability of being non-zero. That is, they all have equal contributions. It does not reflect the fact that the result comes from identical feature pairs $x_1, x_2$ and $x_3, x_4$.
Existing variational inference methods (varbvs
) do not provide accurate calculation of marginal posterior probabilities, although like LASSO, existing variational methods are good for prediction purposes.
SuSiE uses a variational inference algorithm which is not only computational convenient, but also directly obtains posterior statements of the desired form.
fitted = susieR::susie(X, y, L=5,
estimate_residual_variance=TRUE,
scaled_prior_variance=0.2,
tol=1e-3, track_fit=TRUE, min_abs_corr=0.1)
pdf('susie.pdf', width =5, height = 5, pointsize=16)
susieR::susie_plot(fitted, y='PIP', b=b, max_cs=0.4, main = paste('SuSiE, ', length(fitted$sets$cs), 'CS identified'))
dev.off()
%preview susie.pdf -s png --dpi 100
Here SuSiE identifies 2 sets of variables -- 2 confidence sets each containing one causal variable (blue and green circles). The marginal posterior is the same as from existing Bayesian methods (FINEMAP result), but with the two sets identified one can easily state that the causal variables are ($x_1$ or $x_2$), and ($x_3$ or $x_4$). Within each set, the contributions of variables are equal because there is no information to distinguish between them.
Notice that SuSiE does not directly provide posterior of model configuration, that is, inference on ($x_1$ and $x_3$), or ($x_1$ and $x_4$), or ($x_2$ and $x_3$), or ($x_2$ and $x_4$).
Effect size estimate by SuSiE,
bhat = coef(fitted)[-1]
pdf('susie_eff.pdf', width =5, height = 5, pointsize=16)
susieR::susie_plot(bhat, y='bhat', b=b, main = 'SuSiE, effect size estimate')
dev.off()
%preview susie_eff.pdf -s png --dpi 100
convert -density 120 \( truth.pdf lasso.pdf +append \) \( finemap.pdf susie.pdf +append \) -append Motivating_Example.png
Exported from manuscript_results/motivating_example.ipynb
committed by Gao Wang on Fri Feb 7 13:38:07 2020 revision 11, 3ed0e7b