`vignettes/intro_correlations.Rmd`

`intro_correlations.Rmd`

In some settings measurements and tests in different conditions may be correlated with one another. For example, in eQTL applications this can occur due to sample overlap among the different conditions.

Failure to deal with such correlations can cause false positives in a
`mashr`

analysis.

To deal with these correlations `mashr`

allows the user to
specify a correlation matrix \(V\) when
setting up the data in `mash_set_data`

. We introduce two
methods to estimate this correlation matrix. The first method is simple
and fast. It estimates the correlation matrix using
`estimate_null_correlation_simple`

, which, as its name
suggests, uses the null tests (specifically, tests without a strong
\(z\) score) to estimate the
correlations. The second method may provide a better `mash`

fit. It estimates the correlations using
`mash_estimate_corr_em`

, which uses an ad hoc EM
algorithm.

The method is described in Urbut et al.

Here we simulate data with correlations.

`# Loading required package: ashr`

```
set.seed(1)
simdata = simple_sims(500,5,1)
V = matrix(0.5,5,5)
diag(V) = 1
simdata$Bhat = simdata$B + mvtnorm::rmvnorm(2000, sigma = V)
```

Read in the data, and estimate correlations:

```
data = mash_set_data(simdata$Bhat, simdata$Shat)
V.simple = estimate_null_correlation_simple(data)
data.Vsimple = mash_update_data(data, V=V.simple)
```

Now we have two mash data objects, one (`data.Vsimple`

)
with correlations specified, and one without (`data`

). So
analyses using `data.Vsimple`

will allow for correlations,
whereas analyses using `data`

will assume measurements are
independent.

Here, for illustration purposes, we proceed to analyze the data with correlations, using just the simple canonical covariances as in the initial introductory vignette.

```
U.c = cov_canonical(data.Vsimple)
m.Vsimple = mash(data.Vsimple, U.c) # fits with correlations because data.V includes correlation information
```

```
# - Computing 2000 x 151 likelihood matrix.
# - Likelihood calculations took 0.04 seconds.
# - Fitting model with 151 mixture components.
# - Model fitting took 0.57 seconds.
# - Computing posterior matrices.
# - Computation allocated took 0.01 seconds.
```

`print(get_loglik(m.Vsimple),digits=10) # log-likelihood of the fit with correlations set to V`

`# [1] -14689.87712`

We can also compare with the original analysis. (Note that the
canonical covariances do not depend on the correlations, so we can use
the same `U.c`

here for both analyses. If we used data-driven
covariances we might prefer to estimate these separately for each
analysis as the correlations would affect them.)

`m.orig = mash(data, U.c) # fits without correlations because data object was set up without correlations`

```
# - Computing 2000 x 151 likelihood matrix.
# - Likelihood calculations took 0.04 seconds.
# - Fitting model with 151 mixture components.
# - Model fitting took 0.63 seconds.
# - Computing posterior matrices.
# - Computation allocated took 0.01 seconds.
```

`print(get_loglik(m.orig),digits=10)`

`# [1] -14904.79133`

```
loglik = c(get_loglik(m.orig), get_loglik(m.Vsimple))
significant = c(length(get_significant_results(m.orig)), length(get_significant_results(m.Vsimple)))
false_positive = c(sum(get_significant_results(m.orig) < 501),
sum(get_significant_results(m.Vsimple) < 501))
tb = rbind(loglik, significant, false_positive)
colnames(tb) = c('without cor', 'V simple')
row.names(tb) = c('log likelihood', '# significance', '# False positive')
tb
```

```
# without cor V simple
# log likelihood -14904.79 -14689.88
# # significance 410.00 89.00
# # False positive 62.00 2.00
```

The log-likelihood with correlations is higher than without correlations. The false positives reduce.

The method is described in Yuxin Zou’s thesis.

To estimate the residual correlations using EM method, it requires covariance matrices for the signals. We proceed with the simple canonical covariances.

With `details = TRUE`

in
`mash_estimate_corr_em`

, it returns the estimates residual
correlation matrix with the mash fit.

```
V.em = mash_estimate_corr_em(data, U.c, details = TRUE)
m.Vem = V.em$mash.model
print(get_loglik(m.Vem),digits=10) # log-likelihood of the fit
```

`# [1] -14654.32361`

```
loglik = c(get_loglik(m.orig), get_loglik(m.Vsimple), get_loglik(m.Vem))
significant = c(length(get_significant_results(m.orig)), length(get_significant_results(m.Vsimple)),
length(get_significant_results(m.Vem)))
false_positive = c(sum(get_significant_results(m.orig) < 501),
sum(get_significant_results(m.Vsimple) < 501),
sum(get_significant_results(m.Vem) < 501))
tb = rbind(loglik, significant, false_positive)
colnames(tb) = c('without cor', 'V simple', 'V EM')
row.names(tb) = c('log likelihood', '# significance', '# False positive')
tb
```

```
# without cor V simple V EM
# log likelihood -14904.79 -14689.88 -14654.32
# # significance 410.00 89.00 95.00
# # False positive 62.00 2.00 0.00
```

Comparing with Method 1, the log likelihood from Method 2 is higher.

The EM updates in `mash_estimate_corr_em`

needs some time
to converge. There are several things we can do to reduce the running
time. First of all, we can set the number of iterations to a small
number. Because there is a large improvement in the log-likelihood
within the first few iterations, running the algorithm with small number
of iterations provides estimates of correlation matrix that is better
than the initial value. Moreover, we can estimate the correlation matrix
using a random subset of genes, not the whole observed genes.