Sum of Single Effects (SuSiE) Regression using Summary Statistics

susie_rss performs variable selection under a sparse Bayesian multiple linear regression of $Y$ on $X$ using the z-scores from standard univariate regression of $Y$ on each column of $X$, an estimate, $R$, of the correlation matrix for the columns of $X$, and optionally, but strongly recommended, the sample size n. See “Details” for other ways to call susie_rss

susie_rss(
  z,
  R,
  n,
  bhat,
  shat,
  var_y,
  z_ld_weight = 0,
  estimate_residual_variance = FALSE,
  prior_variance = 50,
  check_prior = TRUE,
  ...
)

Arguments

z: p-vector of z-scores.
R: p x p correlation matrix.
n: The sample size.
bhat: Alternative summary data giving the estimated effects (a vector of length p). This, together with shat, may be provided instead of z.
shat: Alternative summary data giving the standard errors of the estimated effects (a vector of length p). This, together with bhat, may be provided instead of z.
var_y: The sample variance of y, defined as $y'y/(n-1)$. When the sample variance is not provided, the coefficients (returned from coef) are computed on the “standardized” X, y scale.
z_ld_weight: This parameter is included for backwards compatibility with previous versions of the function, but it is no longer recommended to set this to a non-zero value. When z_ld_weight > 0, the matrix R is adjusted to be cov2cor((1-w)*R + w*tcrossprod(z)), where w = z_ld_weight.
estimate_residual_variance: The default is FALSE, the residual variance is fixed to 1 or variance of y. If the in-sample LD matrix is provided, we recommend setting estimate_residual_variance = TRUE.
prior_variance: The prior variance(s) for the non-zero noncentrality parameterss $\tilde{b}_l$. It is either a scalar, or a vector of length L. When the susie_suff_stat option estimate_prior_variance is set to TRUE (which is highly recommended) this simply provides an initial value for the prior variance. The default value of 50 is simply intended to be a large initial value. Note this setting is only relevant when n is unknown. If n is known, the relevant option is scaled_prior_variance in susie_suff_stat.
check_prior: When check_prior = TRUE, it checks if the estimated prior variance becomes unreasonably large (comparing with 100 * max(abs(z))^2).
...: Other parameters to be passed to susie_suff_stat.

Value

A "susie" object with the following elements:

alpha: An L by p matrix of posterior inclusion probabilites.
mu: An L by p matrix of posterior means, conditional on inclusion.
mu2: An L by p matrix of posterior second moments, conditional on inclusion.
lbf: log-Bayes Factor for each single effect.
lbf_variable: log-Bayes Factor for each variable and single effect.
V: Prior variance of the non-zero elements of b.
elbo: The value of the variational lower bound, or “ELBO” (objective function to be maximized), achieved at each iteration of the IBSS fitting procedure.
sets: Credible sets estimated from model fit; see susie_get_cs for details.
pip: A vector of length p giving the (marginal) posterior inclusion probabilities for all p covariates.
niter: Number of IBSS iterations that were performed.
converged: TRUE or FALSE indicating whether the IBSS converged to a solution within the chosen tolerance level.

Details

In some applications, particularly genetic applications, it is desired to fit a regression model ($Y = Xb + E$ say, which we refer to as "the original regression model" or ORM) without access to the actual values of $Y$ and $X$, but given only some summary statistics. susie_rss assumes availability of z-scores from standard univariate regression of $Y$ on each column of $X$, and an estimate, $R$, of the correlation matrix for the columns of $X$ (in genetic applications $R$ is sometimes called the “LD matrix”).

With the inputs z, R and sample size n, susie_rss computes PVE-adjusted z-scores z_tilde, and calls susie_suff_stat with XtX = (n-1)R, Xty = $\sqrt{n-1} z_tilde$, yty = n-1, n = n. The output effect estimates are on the scale of $b$ in the ORM with standardized $X$ and $y$. When the LD matrix R and the z-scores z are computed using the same matrix $X$, the results from susie_rss are same as, or very similar to, susie with standardized $X$ and $y$.

Alternatively, if the user provides n, bhat (the univariate OLS estimates from regressing $y$ on each column of $X$), shat (the standard errors from these OLS regressions), the in-sample correlation matrix $R = cov2cor(crossprod(X))$, and the variance of $y$, the results from susie_rss are same as susie with $X$ and $y$. The effect estimates are on the same scale as the coefficients $b$ in the ORM with $X$ and $y$.

In rare cases in which the sample size, $n$, is unknown, susie_rss calls susie_suff_stat with XtX = R and Xty = z, and with residual_variance = 1. The underlying assumption of performing the analysis in this way is that the sample size is large (i.e., infinity), and/or the effects are small. More formally, this combines the log-likelihood for the noncentrality parameters, $\tilde{b} = \sqrt{n} b$, $$L(\tilde{b}; z, R) = -(\tilde{b}'R\tilde{b} - 2z'\tilde{b})/2,$$ with the “susie prior” on $\tilde{b}$; see susie and Wang et al (2020) for details. In this case, the effect estimates returned by susie_rss are on the noncentrality parameter scale.

The estimate_residual_variance setting is FALSE by default, which is recommended when the LD matrix is estimated from a reference panel. When the LD matrix R and the summary statistics z (or bhat, shat) are computed using the same matrix $X$, we recommend setting estimate_residual_variance = TRUE.

References

G. Wang, A. Sarkar, P. Carbonetto and M. Stephens (2020). A simple new approach to variable selection in regression, with application to genetic fine-mapping. Journal of the Royal Statistical Society, Series B 82, 1273-1300 https://doi.org/10.1101/501114.

Y. Zou, P. Carbonetto, G. Wang and M. Stephens (2021). Fine-mapping from summary data with the “Sum of Single Effects” model. bioRxiv https://doi.org/10.1101/2021.11.03.467167.

Examples

set.seed(1)
n = 1000
p = 1000
beta = rep(0,p)
beta[1:4] = 1
X = matrix(rnorm(n*p),nrow = n,ncol = p)
X = scale(X,center = TRUE,scale = TRUE)
y = drop(X %*% beta + rnorm(n))

input_ss = compute_suff_stat(X,y,standardize = TRUE)
ss   = univariate_regression(X,y)
R    = with(input_ss,cov2cor(XtX))
zhat = with(ss,betahat/sebetahat)
res  = susie_rss(zhat,R, n=n)
#> WARNING: XtX is not symmetric; forcing XtX to be symmetric by replacing XtX with (XtX + t(XtX))/2

# Toy example illustrating behaviour susie_rss when the z-scores
# are mostly consistent with a non-invertible correlation matrix.
# Here the CS should contain both variables, and two PIPs should
# be nearly the same.
z = c(6,6.01)
R = matrix(1,2,2)
fit = susie_rss(z,R)
#> WARNING: Providing the sample size (n), or even a rough estimate of n, is highly recommended. Without n, the implicit assumption is n is large (Inf) and the effect sizes are small (close to zero).
print(fit$sets$cs)
#> $L1
#> [1] 1 2
#> 
print(fit$pip)
#> [1] 0.4854079 0.5145921

# In this second toy example, the only difference is that one
# z-score is much larger than the other. Here we expect that the
# second PIP will be much larger than the first.
z = c(6,7)
R = matrix(1,2,2)
fit = susie_rss(z,R)
#> WARNING: Providing the sample size (n), or even a rough estimate of n, is highly recommended. Without n, the implicit assumption is n is large (Inf) and the effect sizes are small (close to zero).
print(fit$sets$cs)
#> $L1
#> [1] 2
#> 
print(fit$pip)
#> [1] 0.001713869 0.998286131