`R/susie_rss.R`

`susie_rss.Rd`

`susie_rss`

performs variable selection under a
sparse Bayesian multiple linear regression of \(Y\) on \(X\)
using the z-scores from standard univariate regression
of \(Y\) on each column of \(X\), an estimate, \(R\), of
the correlation matrix for the columns of \(X\), and optionally,
*but strongly recommended*, the sample size n. See
“Details” for other ways to call `susie_rss`

```
susie_rss(
z,
R,
n,
bhat,
shat,
var_y,
z_ld_weight = 0,
estimate_residual_variance = FALSE,
prior_variance = 50,
check_prior = TRUE,
...
)
```

- z
p-vector of z-scores.

- R
p x p correlation matrix.

- n
The sample size.

- bhat
Alternative summary data giving the estimated effects (a vector of length p). This, together with

`shat`

, may be provided instead of`z`

.- shat
Alternative summary data giving the standard errors of the estimated effects (a vector of length p). This, together with

`bhat`

, may be provided instead of`z`

.- var_y
The sample variance of y, defined as \(y'y/(n-1)\). When the sample variance is not provided, the coefficients (returned from

`coef`

) are computed on the “standardized” X, y scale.- z_ld_weight
This parameter is included for backwards compatibility with previous versions of the function, but it is no longer recommended to set this to a non-zero value. When

`z_ld_weight > 0`

, the matrix`R`

is adjusted to be`cov2cor((1-w)*R + w*tcrossprod(z))`

, where`w = z_ld_weight`

.- estimate_residual_variance
The default is FALSE, the residual variance is fixed to 1 or variance of y. If the in-sample LD matrix is provided, we recommend setting

`estimate_residual_variance = TRUE`

.- prior_variance
The prior variance(s) for the non-zero noncentrality parameterss \(\tilde{b}_l\). It is either a scalar, or a vector of length L. When the

`susie_suff_stat`

option`estimate_prior_variance`

is set to`TRUE`

(which is highly recommended) this simply provides an initial value for the prior variance. The default value of 50 is simply intended to be a large initial value. Note this setting is only relevant when`n`

is unknown. If`n`

is known, the relevant option is`scaled_prior_variance`

in`susie_suff_stat`

.- check_prior
When

`check_prior = TRUE`

, it checks if the estimated prior variance becomes unreasonably large (comparing with 100 * max(abs(z))^2).- ...
Other parameters to be passed to

`susie_suff_stat`

.

A `"susie"`

object with the following
elements:

- alpha
An L by p matrix of posterior inclusion probabilites.

- mu
An L by p matrix of posterior means, conditional on inclusion.

- mu2
An L by p matrix of posterior second moments, conditional on inclusion.

- lbf
log-Bayes Factor for each single effect.

- lbf_variable
log-Bayes Factor for each variable and single effect.

- V
Prior variance of the non-zero elements of b.

- elbo
The value of the variational lower bound, or “ELBO” (objective function to be maximized), achieved at each iteration of the IBSS fitting procedure.

- sets
Credible sets estimated from model fit; see

`susie_get_cs`

for details.- pip
A vector of length p giving the (marginal) posterior inclusion probabilities for all p covariates.

- niter
Number of IBSS iterations that were performed.

- converged
`TRUE`

or`FALSE`

indicating whether the IBSS converged to a solution within the chosen tolerance level.

In some applications, particularly genetic applications,
it is desired to fit a regression model (\(Y = Xb + E\) say,
which we refer to as "the original regression model" or ORM)
without access to the actual values of \(Y\) and \(X\), but
given only some summary statistics. `susie_rss`

assumes
availability of z-scores from standard univariate regression of
\(Y\) on each column of \(X\), and an estimate, \(R\), of the
correlation matrix for the columns of \(X\) (in genetic
applications \(R\) is sometimes called the “LD matrix”).

With the inputs `z`

, `R`

and sample size `n`

,
`susie_rss`

computes PVE-adjusted z-scores `z_tilde`

, and
calls `susie_suff_stat`

with `XtX = (n-1)R`

, ```
Xty =
```

\(\sqrt{n-1} z_tilde\), `yty = n-1`

, `n = n`

. The
output effect estimates are on the scale of \(b\) in the ORM with
*standardized* \(X\) and \(y\). When the LD matrix
`R`

and the z-scores `z`

are computed using the same
matrix \(X\), the results from `susie_rss`

are same as, or
very similar to, `susie`

with *standardized* \(X\) and
\(y\).

Alternatively, if the user provides `n`

, `bhat`

(the
univariate OLS estimates from regressing \(y\) on each column of
\(X\)), `shat`

(the standard errors from these OLS
regressions), the in-sample correlation matrix \(R =
cov2cor(crossprod(X))\), and the variance of \(y\), the results
from `susie_rss`

are same as `susie`

with \(X\) and
\(y\). The effect estimates are on the same scale as the
coefficients \(b\) in the ORM with \(X\) and \(y\).

In rare cases in which the sample size, \(n\), is unknown,
`susie_rss`

calls `susie_suff_stat`

with `XtX = R`

and `Xty = z`

, and with `residual_variance = 1`

. The
underlying assumption of performing the analysis in this way is
that the sample size is large (*i.e.*, infinity), and/or the
effects are small. More formally, this combines the log-likelihood
for the noncentrality parameters, \(\tilde{b} = \sqrt{n} b\),
$$L(\tilde{b}; z, R) = -(\tilde{b}'R\tilde{b} -
2z'\tilde{b})/2,$$ with the “susie prior” on
\(\tilde{b}\); see `susie`

and Wang *et al*
(2020) for details. In this case, the effect estimates returned by
`susie_rss`

are on the noncentrality parameter scale.

The `estimate_residual_variance`

setting is `FALSE`

by
default, which is recommended when the LD matrix is estimated from
a reference panel. When the LD matrix `R`

and the summary
statistics `z`

(or `bhat`

, `shat`

) are computed
using the same matrix \(X\), we recommend setting
`estimate_residual_variance = TRUE`

.

G. Wang, A. Sarkar, P. Carbonetto and M. Stephens (2020). A simple
new approach to variable selection in regression, with application
to genetic fine-mapping. *Journal of the Royal Statistical
Society, Series B* **82**, 1273-1300
https://doi.org/10.1101/501114.

Y. Zou, P. Carbonetto, G. Wang and M. Stephens (2021).
Fine-mapping from summary data with the “Sum of Single Effects”
model. *bioRxiv* https://doi.org/10.1101/2021.11.03.467167.

```
set.seed(1)
n = 1000
p = 1000
beta = rep(0,p)
beta[1:4] = 1
X = matrix(rnorm(n*p),nrow = n,ncol = p)
X = scale(X,center = TRUE,scale = TRUE)
y = drop(X %*% beta + rnorm(n))
input_ss = compute_suff_stat(X,y,standardize = TRUE)
ss = univariate_regression(X,y)
R = with(input_ss,cov2cor(XtX))
zhat = with(ss,betahat/sebetahat)
res = susie_rss(zhat,R, n=n)
#> WARNING: XtX is not symmetric; forcing XtX to be symmetric by replacing XtX with (XtX + t(XtX))/2
# Toy example illustrating behaviour susie_rss when the z-scores
# are mostly consistent with a non-invertible correlation matrix.
# Here the CS should contain both variables, and two PIPs should
# be nearly the same.
z = c(6,6.01)
R = matrix(1,2,2)
fit = susie_rss(z,R)
#> WARNING: Providing the sample size (n), or even a rough estimate of n, is highly recommended. Without n, the implicit assumption is n is large (Inf) and the effect sizes are small (close to zero).
print(fit$sets$cs)
#> $L1
#> [1] 1 2
#>
print(fit$pip)
#> [1] 0.4854079 0.5145921
# In this second toy example, the only difference is that one
# z-score is much larger than the other. Here we expect that the
# second PIP will be much larger than the first.
z = c(6,7)
R = matrix(1,2,2)
fit = susie_rss(z,R)
#> WARNING: Providing the sample size (n), or even a rough estimate of n, is highly recommended. Without n, the implicit assumption is n is large (Inf) and the effect sizes are small (close to zero).
print(fit$sets$cs)
#> $L1
#> [1] 2
#>
print(fit$pip)
#> [1] 0.001713869 0.998286131
```