Multiple Regression with Adaptive Shrinkage

Model fitting algorithms for Multiple Regression with Adaptive Shrinkage ("Mr.ASH"). Mr.ASH is a variational empirical Bayes (VEB) method for multiple linear regression. The fitting algorithms (locally) maximize the approximate marginal likelihood (the "evidence lower bound", or ELBO) via coordinate-wise updates.

Usage

mr.ash(
  X,
  y,
  Z = NULL,
  sa2 = NULL,
  method_q = c("sigma_dep_q", "sigma_indep_q"),
  method_g = c("caisa", "accelerate", "block"),
  max.iter = 1000,
  min.iter = 1,
  beta.init = NULL,
  update.pi = TRUE,
  pi = NULL,
  update.sigma2 = TRUE,
  sigma2 = NULL,
  update.order = NULL,
  standardize = FALSE,
  intercept = TRUE,
  tol = set_default_tolerance(),
  verbose = TRUE
)

Arguments

X: The input matrix, of dimension (n,p); each column is a single predictor; and each row is an observation vector. Here, n is the number of samples and p is the number of predictors. The matrix cannot be sparse.
y: The observed continuously-valued responses, a vector of length p.
Z: The covariate matrix, of dimension (n,k), where k is the number of covariates. This feature is not yet implemented; Z must be set to NULL.
sa2: The vector of prior mixture component variances. The variances should be in increasing order, starting at zero; that is, sort(sa2) should be the same as sa2. When sa2 is NULL, the default setting is used, sa2[k] = (2^(0.05*(k-1)) - 1)^2, for k = 1:20. For this default setting, sa2[1] = 0, and sa2[20] is roughly 1.
method_q: The algorithm used to update the variational approximation to the posterior distribution of the regression coefficients, method = "sigma_dep_q" and method = "sigma_indep_q", take different approaches to updating the residual variance $sigma^2$.
method_g: method = "caisa", an abbreviation of "Cooridinate Ascent Iterative Shinkage Algorithm", fits the model by approximate EM; it iteratively updates the variational approximation to the posterior distribution of the regression coefficients (the approximate E-step) and the model parameters (mixture weights and residual covariance) in an approximate M-step. Settings method = "block" and method = "accelerate" are considered experimental.
max.iter: The maximum number of outer loop iterations allowed.
min.iter: The minimum number of outer loop iterations allowed.
beta.init: The initial estimate of the (approximate) posterior mean regression coefficients. This should be NULL, or a vector of length p. When beta.init is NULL, the posterior mean coefficients are all initially set to zero.
update.pi: If update.pi = TRUE, the mixture proportions in the mixture-of-normals prior are estimated from the data. In the manuscript, update.pi = TRUE.
pi: The initial estimate of the mixture proportions $\pi_1, \ldots, \pi_K$. If pi is NULL, the mixture weights are initialized to rep(1/K,K)

, where K = length(sa2).

update.sigma2: If update.sigma2 = TRUE, the residual variance $sigma^2$ is estimated from the data. In the manuscript, update.sigma = TRUE.
sigma2: The initial estimate of the residual variance, $\sigma^2$. If sigma2 = NULL, the residual variance is initialized to the empirical variance of the residuals based on the initial estimates of the regression coefficients, beta.init, after removing linear effects of the intercept and any covariances.
update.order: The order in which the co-ordinate ascent updates for estimating the posterior mean coefficients are performed. update.order can be NULL, "random", or any permutation of $(1,\ldots,p)$, where p is the number of columns in the input matrix X. When update.order is NULL, the co-ordinate ascent updates are performed in order in which they appear in X; this is equivalent to setting update.order = 1:p. When update.order = "random", the co-ordinate ascent updates are performed in a randomly generated order, and this random ordering is different at each outer-loop iteration.
standardize: The logical flag for standardization of the columns of X variable, prior to the model fitting. The coefficients are always returned on the original scale.
intercept: When intercept = TRUE, an intercept is included in the regression model.
tol: Additional settings controlling behaviour of the model fitting algorithm. tol$convtol controls the termination criterion for the model fitting. The outer-loop updates stop when the relative L2 change in the estimates of the posterior mean coefficients is less than convtol, i.e., ||beta_new - beta_old||_2 / max(1, ||beta_old||_2) < convtol. tol$epstol is a small, positive number added to the likelihoods to avoid logarithms of zero.
verbose: If verbose = TRUE, some information about the status of the model fitting is printed to the console.

Value

A list object with the following elements:

intercept: The estimated intercept.
beta: A vector containing posterior mean estimates of the regression coefficients for all predictors.
sigma2: The estimated residual variance.
pi: A vector of containing the estimated mixture proportions.
iter: The number of outer-loop iterations that were performed.
update.order: The ordering used for performing the coordinate-wise updates. For update.order = "random", the orderings for outer-loop iterations are provided in a vector of length p*max.iter, where p is the number of predictors.
varobj: A vector of length numiter, containing the value of the variational objective (equal to the negative "evidence lower bound") attained at each (outer-loop) model fitting iteration. Note that the objective does not account for the intercept term, even when intercept = TRUE; therefore, this value shoudl be interpreted as being an approximation to the marginal likelihood conditional on the estimate of the intercept.
data: The preprocessed data (X, Z, y) provided as input to the model fitting algorithm. data$w is equal to diag(crossprod(X)), in which X is the preprocessed data matrix. Additionally, data$sa2 gives the prior variances used.

Details

Mr.ASH is a statistical inference method for the following multiple linear regression model: $$y | X, \beta, \sigma^2 ~ N(X \beta, \sigma I_n),$$ in which the regression coefficients $\beta$ admit a mixture-of-normals prior, $$\beta | \pi, \sigma ~ g = \sum_{k=1}^K N(0, \sigma^2 \sigma_k^2).$$ Each mixture component in the prior, $g$, is a normal density centered at zero, with variance $\sigma^2 \sigma_k^2$.

The fitting algorithm, it run for a large enough number of iterations, will find an approximate posterior for the regression coefficients, denoted by $q(\beta)$, residual variance parameter $sigma^2$, and prior mixture weights $\pi_1, \ldots, \pi_K$ maximizing the evidence lower bound, $$F(q, \pi, \sigma^2) = E_q \log p(y | X, \beta, \sigma^2) - \sum_{j=1}^p D_{KL}(q_j || g),$$ where $D_{KL}(q_j || g)$ denotes the Kullback-Leibler (KL) divergence, a measure of the "distance" between (approximate) posterior $q_j(\beta_j)$ and prior $g(\beta_j)$. The fitting algorithm iteratively updates the approximate posteriors $q_1, \ldots, q_p$, separately for each $j = 1, \ldots, p$ (in an order determined by update.order), then separately updates the mixture weights and $\pi$ and residual variance $\sigma^2$. This coordinate-wise update scheme iterates until the convergence criterion is met, or until the algorithm hits an upper bound on the number of iterations (specified by max.iter). Coordinate-wise optimization algorithms for model fitting are implemented in C++ for efficient handling of large-scale data

See ‘References’ for more details about the model and algorithm.

References

Y. Kim (2020), Bayesian shrinkage methods for high dimensional regression. Ph.D. thesis, University of Chicago.

Examples

### generate synthetic data
set.seed(1)
n           = 200
p           = 300
X           = matrix(rnorm(n*p),n,p)
beta        = double(p)
beta[1:10]  = 1:10
y           = X %*% beta + rnorm(n)

### fit Mr.ASH
fit.mr.ash  = mr.ash(X,y, method_q = "sigma_indep_q")
#> Mr.ASH terminated at iteration 24: max|beta|=1.0032e+01, sigma2=3.0606e+00, pi0=0.7758
#' fit.mr.ash  = mr.ash(X,y, method_q = "sigma_dep_q")

### prediction routine
Xnew        = matrix(rnorm(n*p),n,p)
ynew        = Xnew %*% beta + rnorm(n)
ypred       = predict(fit.mr.ash, Xnew)

### test error
rmse        = norm(ynew - ypred, '2') / sqrt(n)

### coefficients
betahat     = predict(fit.mr.ash, type = "coefficients")
# this equals c(fit.mr.ash$intercept, fit.mr.ash$beta)