vignettes/intro_mashcommonbaseline.Rmd
intro_mashcommonbaseline.Rmd
This vignette illustrates how to use mashr
to estimate
the change in some quantity measured in multiple conditions compared
with a common control condition.
We assume that we have measurements in multiple conditions, and want to estimate the deviation in each condition from the control: that is, the difference in mean between that condition and the control condition. When we compare every condition to the same control then the observed deviations are correlated with one another (even under the null where there are no true differences among conditions). These correlations, if not properly accounted for, can lead to many false positives in a multivariate analysis. This vignette illustrates how to properly account for such correlations.
Here is the write-up for the details of the/ model. When there is no control condition in the study, we can compare the quantity in different conditions with the mean. We illustrate an example in the common baseline at the mean vignette.
To deal with these correlations, mashr allows the user to specify the
reference condition using mash_update_data
, after setting
up the data in mash_set_data
.
Note: The correlations in deviations induced by
comparing to a common baseline/control occur even if the measurements in
different conditions are entirely independent. If the measurements in
different conditions are also correlated with one another (eg in eQTL
applications this can occur due to sample overlap among the different
conditions) then this induces additional correlations into the analysis
that should also be taken into account. In common baseline
analysis, such additional correlations can be specified by the user (we
have not yet implemented methods to estimate this additional correlation
from the data).
Here we simulate data for illustration. This simulation routine creates a dataset with 8 conditions and 12000 samples, the last condition is the control condition. 90% of the samples have no deviations from the control condition. The remaining 10% of the samples are “non-null”, and consist of equal numbers of three different types of deviations: equal among conditions \(1, \cdots, 7\), present only in condition 1, independent across conditions \(1, \cdots, 7\).
Our goal is to estimate the deviations in condition \(1, \cdots, 7\) compared with the control condition.
# Loading required package: ashr
set.seed(1)
simdata = sim_contrast2(nsamp = 12000, ncond = 8)
We demonstrate the right way and the wrong to do the analysis
Read in the data, and set the control condition
data = mash_set_data(simdata$Chat, simdata$Shat)
data.L = mash_update_data(data, ref = 8)
The updated mash data object (data.L
) includes the
induced correlation internally.
We proceed the analysis using just the simple canonical covariances as in the initial introductory vignette.
U.c = cov_canonical(data.L)
mashcontrast.model = mash(data.L, U.c, algorithm.version = 'R')
# - Computing 12000 x 181 likelihood matrix.
# - Likelihood calculations took 1.02 seconds.
# - Fitting model with 181 mixture components.
# - Model fitting took 5.58 seconds.
# - Computing posterior matrices.
# - Computation allocated took 0.17 seconds.
print(get_loglik(mashcontrast.model),digits=10)
# [1] -105525.1372
Use get_significant_results
to find the indices of
effects that are ‘significant’:
length(get_significant_results(mashcontrast.model))
# [1] 58
The number of false positive is 1.
We fit the mash model ignoring the induced correlation.
L = contrast_matrix(8, ref=8)
data.wrong = mash_set_data(Bhat = simdata$Chat %*% t(L), Shat = 1)
m = mash(data.wrong, U.c)
# - Computing 12000 x 181 likelihood matrix.
# - Likelihood calculations took 0.30 seconds.
# - Fitting model with 181 mixture components.
# - Model fitting took 5.74 seconds.
# - Computing posterior matrices.
# - Computation allocated took 0.06 seconds.
print(get_loglik(m),digits = 10)
# [1] -111355.197
We can see that the log likelihood is lower, since it does not consider the induced correlation.
There are 3358 significant effects, 2932 of them are false positives. The number of false positives is much more than the one include the induced correlation.