Skip to contents

Performs a sparse Bayesian multiple linear regression of y on X, using the "Sum of Single Effects" model from Wang et al (2020). In brief, this function fits the regression model \(y = \mu + X b + e\), where elements of \(e\) are i.i.d. normal with zero mean and variance residual_variance, \(\mu\) is an intercept term and \(b\) is a vector of length p representing the effects to be estimated. The “susie assumption” is that \(b = \sum_{l=1}^L b_l\) where each \(b_l\) is a vector of length p with exactly one non-zero element. The prior on the non-zero element is normal with zero mean and variance var(y) * scaled_prior_variance. The value of L is fixed, and should be chosen to provide a reasonable upper bound on the number of non-zero effects to be detected. Typically, the hyperparameters residual_variance and scaled_prior_variance will be estimated during model fitting, although they can also be fixed as specified by the user. See functions susie_get_cs and other functions of form susie_get_* to extract the most commonly-used results from a susie fit.

#' @details The function susie implements the IBSS algorithm from Wang et al (2020). The option refine = TRUE implements an additional step to help reduce problems caused by convergence of the IBSS algorithm to poor local optima (which is rare in our experience, but can provide misleading results when it occurs). The refinement step incurs additional computational expense that increases with the number of CSs found in the initial run.

The function susie_ss implements essentially the same algorithms, but using sufficient statistics. (The statistics are sufficient for the regression coefficients \(b\), but not for the intercept \(\mu\); see below for how the intercept is treated.) If the sufficient statistics are computed correctly then the results from susie_ss should be the same as (or very similar to) susie, although runtimes will differ as discussed below. The sufficient statistics are the sample size n, and then the p by p matrix \(X'X\), the p-vector \(X'y\), and the sum of squared y values \(y'y\), all computed after centering the columns of \(X\) and the vector \(y\) to have mean 0; these can be computed using compute_suff_stat.

The handling of the intercept term in susie_ss needs some additional explanation. Computing the summary data after centering X and y effectively ensures that the resulting posterior quantities for \(b\) allow for an intercept in the model; however, the actual value of the intercept cannot be estimated from these centered data. To estimate the intercept term the user must also provide the column means of \(X\) and the mean of \(y\) (X_colmeans and y_mean). If these are not provided, they are treated as NA, which results in the intercept being NA. If for some reason you prefer to have the intercept be 0 instead of NA then set X_colmeans = 0,y_mean = 0.

For completeness, we note that if susie_ss is run on \(X'X, X'y, y'y\) computed without centering \(X\) and \(y\), and with X_colmeans = 0,y_mean = 0, this is equivalent to susie applied to \(X, y\) with intercept = FALSE (although results may differ due to different initializations of residual_variance and scaled_prior_variance). However, this usage is not recommended for for most situations.

The computational complexity of susie is \(O(npL)\) per iteration, whereas susie_ss is \(O(p^2L)\) per iteration (not including the cost of computing the sufficient statistics, which is dominated by the \(O(np^2)\) cost of computing \(X'X\)). Because of the cost of computing \(X'X\), susie will usually be faster. However, if \(n >> p\), and/or if \(X'X\) is already computed, then susie_ss may be faster.

Usage

susie(
  X,
  y,
  L = min(10, ncol(X)),
  scaled_prior_variance = 0.2,
  residual_variance = NULL,
  prior_weights = NULL,
  null_weight = 0,
  standardize = TRUE,
  intercept = TRUE,
  estimate_residual_variance = TRUE,
  estimate_residual_method = c("MoM", "MLE", "Servin_Stephens"),
  estimate_prior_variance = TRUE,
  estimate_prior_method = c("optim", "EM", "simple"),
  unmappable_effects = c("none", "inf", "ash"),
  check_null_threshold = 0,
  prior_tol = 1e-09,
  residual_variance_upperbound = Inf,
  model_init = NULL,
  coverage = 0.95,
  min_abs_corr = 0.5,
  compute_univariate_zscore = FALSE,
  na.rm = FALSE,
  max_iter = 100,
  tol = 0.001,
  convergence_method = c("elbo", "pip"),
  verbose = FALSE,
  track_fit = FALSE,
  residual_variance_lowerbound = var(drop(y))/10000,
  refine = FALSE,
  n_purity = 100,
  alpha0 = 0,
  beta0 = 0
)

Arguments

X

An n by p matrix of covariates.

y

The observed responses, a vector of length n.

L

Maximum number of non-zero effects in the model. If L is larger than the number of covariates, p, L is set to p.

scaled_prior_variance

The prior variance, divided by var(y) (or by (1/(n-1))yty for susie_ss); that is, the prior variance of each non-zero element of b is var(y) * scaled_prior_variance. The value provided should be either a scalar or a vector of length L. If estimate_prior_variance = TRUE, this provides initial estimates of the prior variances.

residual_variance

Variance of the residual. If estimate_residual_variance = TRUE, this value provides the initial estimate of the residual variance. By default, it is set to var(y) in susie and (1/(n-1))yty in susie_ss.

prior_weights

A vector of length p, in which each entry gives the prior probability that corresponding column of X has a nonzero effect on the outcome, y.

null_weight

Prior probability of no effect (a number between 0 and 1, and cannot be exactly 1).

standardize

If standardize = TRUE, standardize the columns of X to unit variance prior to fitting (or equivalently standardize XtX and Xty to have the same effect). Note that scaled_prior_variance specifies the prior on the coefficients of X after standardization (if it is performed). If you do not standardize, you may need to think more carefully about specifying scaled_prior_variance. Whatever your choice, the coefficients returned by coef are given for X on the original input scale. Any column of X that has zero variance is not standardized.

intercept

If intercept = TRUE, the intercept is fitted; it intercept = FALSE, the intercept is set to zero. Setting intercept = FALSE is generally not recommended.

estimate_residual_variance

If estimate_residual_variance = TRUE, the residual variance is estimated, using residual_variance as an initial value. If estimate_residual_variance = FALSE, the residual variance is fixed to the value supplied by residual_variance.

estimate_residual_method

The method used for estimating residual variance. For the original SuSiE model, "MLE" and "MoM" estimation is equivalent, but for the infinitesimal model, "MoM" is more stable. We recommend using "Servin_Stephens" when n < 80 for improved coverage, although it is currently only implemented for individual-level data.

estimate_prior_variance

If estimate_prior_variance = TRUE, the prior variance is estimated (this is a separate parameter for each of the L effects). If provided, scaled_prior_variance is then used as an initial value for the optimization. When estimate_prior_variance = FALSE, the prior variance for each of the L effects is determined by the value supplied to scaled_prior_variance.

estimate_prior_method

The method used for estimating prior variance. When estimate_prior_method = "simple" is used, the likelihood at the specified prior variance is compared to the likelihood at a variance of zero, and the setting with the larger likelihood is retained.

unmappable_effects

The method for modeling unmappable effects: "none", "inf", "ash".

check_null_threshold

When the prior variance is estimated, compare the estimate with the null, and set the prior variance to zero unless the log-likelihood using the estimate is larger by this threshold amount. For example, if you set check_null_threshold = 0.1, this will "nudge" the estimate towards zero when the difference in log-likelihoods is small. A note of caution that setting this to a value greater than zero may lead the IBSS fitting procedure to occasionally decrease the ELBO. This setting is disabled when using unmappable_effects = "inf" or unmappable_effects = "ash".

prior_tol

When the prior variance is estimated, compare the estimated value to prior_tol at the end of the computation, and exclude a single effect from PIP computation if the estimated prior variance is smaller than this tolerance value.

residual_variance_upperbound

Upper limit on the estimated residual variance. It is only relevant when estimate_residual_variance = TRUE.

model_init

A previous susie fit with which to initialize.

coverage

A number between 0 and 1 specifying the “coverage” of the estimated confidence sets.

min_abs_corr

Minimum absolute correlation allowed in a credible set. The default, 0.5, corresponds to a squared correlation of 0.25, which is a commonly used threshold for genotype data in genetic studies.

compute_univariate_zscore

If compute_univariate_zscore = TRUE, the univariate regression z-scores are outputted for each variable.

na.rm

Drop any missing values in y from both X and y.

max_iter

Maximum number of IBSS iterations to perform.

tol

tol A small, non-negative number specifying the convergence tolerance for the IBSS fitting procedure.

convergence_method

When converge_method = "elbo" the fitting procedure halts when the difference in the variational lower bound, or “ELBO” (the objective function to be maximized), is less than tol. When converge_method = "pip" the fitting procedure halts when the maximum absolute difference in alpha is less than tol.

verbose

If verbose = TRUE, the algorithm's progress, a summary of the optimization settings, and refinement progress (if refine = TRUE) are printed to the console.

track_fit

If track_fit = TRUE, trace is also returned containing detailed information about the estimates at each iteration of the IBSS fitting procedure.

residual_variance_lowerbound

Lower limit on the estimated residual variance. It is only relevant when estimate_residual_variance = TRUE.

refine

If refine = TRUE, then an additional iterative refinement procedure is used, after the IBSS algorithm, to check and escape from local optima (see details).

n_purity

Passed as argument n_purity to susie_get_cs.

alpha0

Numerical parameter for the NIG prior when using estimate_residual_method = "Servin_Stephens".

beta0

Numerical parameter for the NIG prior when using estimate_residual_method = "Servin_Stephens".

Value

A "susie" object with some or all of the following elements:

alpha

An L by p matrix of posterior inclusion probabilities.

mu

An L by p matrix of posterior means, conditional on inclusion.

mu2

An L by p matrix of posterior second moments, conditional on inclusion.

Xr

A vector of length n, equal to X %*% colSums(alpha * mu).

lbf

Log-Bayes Factor for each single effect.

lbf_variable

Log-Bayes Factor for each variable and single effect.

intercept

Intercept (fixed or estimated).

sigma2

Residual variance (fixed or estimated).

V

Prior variance of the non-zero elements of b.

elbo

The variational lower bound (or ELBO) achieved at each iteration.

fitted

Vector of length n containing the fitted values.

sets

Credible sets estimated from model fit.

pip

A vector of length p giving the marginal posterior inclusion probabilities.

z

A vector of univariate z-scores.

niter

Number of IBSS iterations performed.

converged

TRUE or FALSE indicating whether the IBSS converged to a solution within the chosen tolerance level.

theta

If unmappable_effects = "inf" or unmappable_effects = "ash", then theta is a p-vector of posterior means for the unmappable effects.

tau2

If unmappable_effects = "inf" or unmappable_effects = "ash", then tau2 is the unmappable variance.