Introduction to mash: data-driven covariances

Goal

This vignette introduces the important topic of using data-driven covariance matries in mashr. You should read the introductory vignette before this.

Outline

Recall the four major steps in a mash analysis:

Read in the data
Set up the covariance matrices to be used
Fit the model
Extract posterior summaries and other quantities of interest

Here we pay more attention to Step 2, which can be split into two steps:

Set up the “canonical” covariance matrices.
Set up the “data-driven” covariance matrices.

The first of these steps (“canonical” covariance matrices) is straightforward using cov_canonical, as illustrated in the introductory vignette.

Setting up the data-driven matrices is slightly more complex, and there are multiple possible approaches. This vignette illustrates one approach.

First we simulate some data for illustration (see the the introductory vignette for more details):

library(ashr)
library(mashr)
set.seed(1)
simdata = simple_sims(500,5,1)

Data-driven covariances

Although the user is free to set up data driven covariance matrices using any method that they would like, typically we suggest the following three-step strategy:

Locate some strong signals. For example, here we do this within R by running a “condition-by-condition” using mash_1by1, but you could do this as a preprocessing step in other software - for example in eQTL studies you might instead put together a separate dataset containing the test results for only the “top” eQTL or eQTLs in each gene.
Use methods such as PCA or factor analysis on these top signals to compute some initial data-driven covariance matrices (e.g. using cov_pca).
Use these data-driven covariance matrices as initializations for the extreme deconvolution (ED) algorithm (using cov_ed), to get some refined data-driven covariance matrix estimates.

The ED step is helpful primarily because the second step will often estimate the covariances of the observed data (Bhat) whereas what is required is the covariances of the actual underlying effects (B), and this is what ED estimates. However, ED can be quite sensitive to initialization, and so the goal of the second step is to provide a good initialization.

Step 1: select strong signals

If your entire data set (matrix of all tests in all conditions) is not too big (eg <100k tests say) then you can simply do this by setting up the entire data set as a mash data object (using mash_set_data), and then running a condition-by-condition (1by1) analysis on all the data. For example, here we select the strong signals as those with lfsr<0.05 in any condition in the 1by1 analysis.

data   = mash_set_data(simdata$Bhat, simdata$Shat)
m.1by1 = mash_1by1(data)
strong = get_significant_results(m.1by1,0.05)

This sets up a vector strong containing the indices corresponding to the “significant” rows of the test results in data. This vector is used in subsequent functions below to specify which rows of data to use.

Step 2: Obtain initial data-driven covariance matrices

There are various approaches to this; here we just illustrate the simplest and quickest (but probably not the best), which is to use PCA. Specifically here we use the function cov_pca to produce covariance matrices based on the top 5 PCs of the strong signals. The result is a list of 6 covariance matrices: one based on all 5 PCs, and the others each based on one PC.

U.pca = cov_pca(data,5,subset=strong)
print(names(U.pca))

Step 3: Apply Extreme Deconvolution

The function cov_ed is used to apply the ED algorithm from a specified initialization (here U.pca) and to a specified subset of signals.

U.ed = cov_ed(data, U.pca, subset=strong)

Run mash

After all this we are ready to run mash using the data driven matrices. Remember the Crucial Rule that we must fit mash to all tests - do not use only the strong subset here!!

m.ed = mash(data, U.ed)
print(get_loglik(m.ed),digits = 10)

From the fit we see that the fit using the data-driven covariances is not as good as when we used the canonical covariances (which was -16120.32; from the introductory vignette). This is expected for this simulation because the simulation actually used the canonical covariances!

In general we recommend running mash with both data-driven and canonical covariances. You could do this by combining the data-driven and canonical covariances as in this code:

U.c = cov_canonical(data)  
m   = mash(data, c(U.c,U.ed))
print(get_loglik(m),digits = 10)

For an example with simulations that do not follow the standard canonical matrices see here.

Session information.

print(sessionInfo())

Matthew Stephens

2023-08-16