Investigating MASH in sparse signal setting¶

Analysis of sex-interacting QTL data. See this ticket for a motivation.

Motivation¶

We've been working on mash anlaysis of cis-eQTL where there are lots of signals. Recently it has been brought to our attention from the community that mash tends to produce a lot more signals than we can believe, eg in transQTL studies. The rough picture is that univariate analysis does not reveal many strong associations (<50 with fdr 0.05 from ~50 conditions combined), but after mash analysis we report >1000.

What we've checked and speculate¶

We have checked for model calibration using permutation. Nothing suspicious popped up. It is possible that when there is indeed a lot of sharing thus boosting the signal a lot, but I'm still suspicious. Certainly the group who analyzed transQTL using mash found way too many. But they report to me that most weights are on data driven matrices.

One possible problem is that the data-driven matrices are learned from the <50 signals. We should either learn from, say, top 1000, or just do not use data-driven matrices. Since I (probably) do not have the data I'm not going to check this.

Another potential issue is that the input data is not well calibrated, in terms of p-value. But I believe the colleagues who raised the problem should have noticed abnormality of p-values before reporting to us (still need to double-check with them).

Plans¶

For our ~50 conditions problem (GTEx V8) the sex specific QTL analysis,

Use FLASH to access pattern of sharing.
Run MASH without data driven covariance, and see if weights are still on the shared canonical component. Make a plot similar to here.
Even simpler than above, run mash just on the "equal effects" covariance, nothing else. This will effectively do a fixed effects analysis combining across 44 tissues, and give some baseline against which other results can be compared.
Splitting the tissues into 2 sets of ~25. Run mash as in 2 above. See how often the effects that are deemed significant by mash have the same sign of average effect in the other set of ~25 conditions. If mash is doing something sensible then it should be much more often than the 50% expected by chance in a no-signal dataset.
As a sanity check, run MASH with just the "identity" covariance. This assumes no sharing between tissues. The association signal should not have been boosted.

Implementation¶

Plans above are implemented in the following commands:

# Full data analysis
sos run analysis/Application_siQTL.ipynb flash
sos run analysis/Application_siQTL.ipynb mash
sos run analysis/Application_siQTL.ipynb mash_eq
sos run analysis/Application_siQTL.ipynb mash_uv
# First 1/2 data
sos run analysis/Application_siQTL.ipynb mash --start 1 --end 22
sos run analysis/Application_siQTL.ipynb mash_eq --start 1 --end 22
# Second 1/2 data
sos run analysis/Application_siQTL.ipynb mash --start 23 --end 44
sos run analysis/Application_siQTL.ipynb mash_eq --start 23 --end 44

Results¶

Results & intermediate files can be downloaded here.

It contains quite a few files from commands above (analysis below). Of interest are:

*.flash.model.pdf file, demonstrating the pattern of effect sharing using a sparse matrix factorization.
*.summary.rds and *.summary.pdf:
1. start_1_end_44 is analysis of full data.
2. start_1_end_22 and start_23_end_44 are halves of data.

Remark 1¶

As motivated above, it would be of interest to see if the estimated significant results are mostly consistent in sign in the two halves of data. If so, then MASH is doing something reasonable. I have generated the results but did not check and plot this particular investigation. To do so, simply load *mahsed_equal_eff.summary.rds file and look at significant_beta_est, and compare that across results.

Remark 2¶

Univariate adaptive shrinkage analysis (setting MASH covariance to identity) in fact shrunk every effect to zero, indicating no significant effects at all when conditions are not analyzed jointly. This is consistent with the univariate analysis results. See *mahsed_independent.summary.rds.

The rest of this notebook contain the code for the analysis -- a source to look things up just in case some quantities in those rds files are not self-explanary.

[global]
parameter: prefix=  "sbeQTL"
parameter: cwd = path('data/mash_out')
parameter: b_strong = path("data/Mashr_all.tissues_sbeQTL_all.genes.TopSNPbyZ_beta.txt.gz")
parameter: s_strong = path("data/Mashr_all.tissues_sbeQTL_all.genes.TopSNPbyZ_se.txt.gz")
parameter: b_random = path("data/Mashr_all.tissues_sbeQTL_all.genes.rand.200K_1_beta.txt.gz")
parameter: s_random = path("data/Mashr_all.tissues_sbeQTL_all.genes.rand.200K_1_se.txt.gz")
parameter: color_file = path("src/gtex_colors.txt")
# EZ or EE model; default to EZ
parameter: alpha = 1
# Subset tissue option: start index
parameter: start = 1
# Subset tissue option: end index
parameter: end = 44

# Load data
[mash_1, flash_1, mash_eq_1, mash_uv_1]
input: b_strong, s_strong, b_random, s_random, color_file
output: f"{cwd:a}/{prefix}.alpha_{alpha}_start_{start}_end_{end}.rds"
R: expand = '${ }'
    library(mashr)
    ## read & sample data
    beta.n=as.matrix(read.table(${b_random:ar}, head=T, as.is=T, row.names=1))
    se.n=as.matrix(read.table(${s_random:ar}, head=T, as.is=T, row.names=1))
    beta.s=as.matrix(read.table(${b_strong:ar}, head=T, as.is=T, row.names=1))
    se.s=as.matrix(read.table(${s_strong:ar}, head=T, as.is=T, row.names=1))
    set.seed(1)
    cols = sample(1:ncol(beta.n))[${start}:${end}]
    beta.n = beta.n[, sort(colnames(beta.n)[cols])]
    se.n = se.n[, sort(colnames(se.n)[cols])]
    beta.s = beta.s[, sort(colnames(beta.s)[cols])]
    se.s = se.s[, sort(colnames(se.s)[cols])]
    ## estimate null correlation
    Vhat = estimate_null_correlation(mash_set_data(beta.n, se.n, alpha=${alpha}), apply_lower_bound = FALSE)
    ## make mash files
    data.random = mash_set_data(beta.n, se.n, V=Vhat, alpha=1)
    data.strong = mash_set_data(beta.s, se.s, V=Vhat, alpha=1)
    ## get GTEx colors
    get_colors = function(color_file, names) {
        gtex.colors = read.table(color_file, head=T, sep = '\t', comment.char = '')
        gtex.colors$tissue_id = paste0('beta_',gtex.colors$tissue_id)
        gtex.colors$tissue_id[which(gtex.colors$tissue_id == 'beta_Brain_Spinal_cord_cervical_c-1')] = 'beta_Brain_Spinal_cord_cervical_c.1'
        colors.index = match(names, gtex.colors$tissue_id)
        colors = as.character(gtex.colors$color_hex[colors.index])
        colors[which(is.na(colors))] = '000000'
        colors = paste0('#',colors)
        return(colors)
    }
    colors = get_colors(${color_file:r}, colnames(beta.s))
    saveRDS(list(data.random=data.random,data.strong=data.strong, colors=colors), ${_output:r})

Using a sparse factor analysis called "FLASH", applied to top QTLs.

# Sparse factor analysis with adaptive shrinkage
[flash_2]
output: f"{_input:n}.flash.model.rds"
R: expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    library(flashr)
    library(mixSQP)
    library(mashr)
    
    my_init_fn <- function(Y, K = 1) {
      ret = flashr:::udv_si(Y, K)
      pos_sum = sum(ret$v[ret$v > 0])
      neg_sum = -sum(ret$v[ret$v < 0])
      if (neg_sum > pos_sum) {
        return(list(u = -ret$u, d = ret$d, v = -ret$v))
      } else
      return(ret)
    }

    flash_pipeline = function(data, ...) {
      ## current state-of-the art
      ## suggested by Jason Willwerscheid
      ## cf: discussion section of
      ## https://willwerscheid.github.io/MASHvFLASH/MASHvFLASHnn2.html
      ebnm_fn = "ebnm_ash"
      ebnm_param = list(l = list(mixcompdist = "normal",
                               optmethod = "mixSQP"),
                        f = list(mixcompdist = "+uniform",
                               optmethod = "mixSQP"))
      ##
      fl_g <- flashr:::flash_greedy_workhorse(data,
                    var_type = "constant",
                    ebnm_fn = ebnm_fn,
                    ebnm_param = ebnm_param,
                    init_fn = "my_init_fn",
                    stopping_rule = "factors",
                    tol = 1e-3,
                    verbose_output = "odF")
      fl_b <- flashr:::flash_backfit_workhorse(data,
                    f_init = fl_g,
                    var_type = "constant",
                    ebnm_fn = ebnm_fn,
                    ebnm_param = ebnm_param,
                    stopping_rule = "factors",
                    tol = 1e-3,
                    verbose_output = "odF")
      return(fl_b)
    }

    cov_flash = function(data, subset = NULL, non_canonical = FALSE, save_model = NULL) {
      if(is.null(subset)) subset = 1:mashr:::n_effects(data)
      b.center = apply(data$Bhat, 2, function(x) x - mean(x))
      ## Only keep factors with at least two values greater than 1 / sqrt(n)
      find_nonunique_effects <- function(fl) {
        thresh <- 1/sqrt(ncol(fl$fitted_values))
        vals_above_avg <- colSums(fl$ldf$f > thresh)
        nonuniq_effects <- which(vals_above_avg > 1)
        return(fl$ldf$f[, nonuniq_effects, drop = FALSE])
      }

      fmodel = flash_pipeline(b.center)
      if (non_canonical)
          flash_f = find_nonunique_effects(fmodel)
      else 
          flash_f = fmodel$ldf$f
      ## row.names(flash_f) = colnames(b)
      if (!is.null(save_model)) saveRDS(list(model=fmodel, factors=flash_f), save_model)
      U.flash = c(cov_from_factors(t(as.matrix(flash_f)), "FLASH"),
          list("tFLASH" = t(fmodel$fitted_values) %*% fmodel$fitted_values / nrow(fmodel$fitted_values)))
      return(U.flash)
    }
    ##
    attach(readRDS(${_input:r}))
    res = cov_flash(data.strong, non_canonical = FALSE, save_model = ${_output:r})
    saveRDS(res, "${_output:nn}.cov.rds")

# Make plot of factor analysis result
[flash_3]
output: f"{_input:n}.pdf"
R: expand = "${ }"
    model = readRDS(${_input:r})$model
    ldf = model$ldf
    pve = model$pve
    colors = readRDS("${_input:nnn}.rds")$colors
    options(repr.plot.width=12, repr.plot.height=5)
    pdf(${_output:r})
    for (i in 1:ncol(ldf$f)) {
        par(mar=c(10,1,1,1))
        barplot(ldf$f[,i], col=colors, names.arg=rownames(ldf$f), axes=F, main = paste("Factor", i, "PVE", round(pve[i], 3)), las = 2, cex.names = 0.6)
    }
    dev.off()

Run MASH on all canonical covariance¶

[mash_2]
output: f"{_input:n}.mashed.rds"
R: expand = '${ }'
    library(mashr)
    attach(readRDS(${_input:r}))
    ## get covariances
    U.c = cov_canonical(data.random)
    ## fit mash
    model = mash(data.random, Ulist = U.c, outputlevel = 1)
    result = mash(data.strong, g=get_fitted_g(model), fixg=TRUE)
    saveRDS(result, ${_output:r})

Run MASH on "equal effect" covariance¶

[mash_eq_2]
output: f"{_input:n}.mashed_equal_eff.rds"
R: expand = '${ }'
    library(mashr)
    attach(readRDS(${_input:r}))
    ## get covariances
    U.c = cov_canonical(data.random, cov_methods= 'equal_effects')
    ## fit mash
    model = mash(data.random, Ulist = U.c, outputlevel = 1)
    result = mash(data.strong, g=get_fitted_g(model), fixg=TRUE)
    saveRDS(result, ${_output:r})

Run MASH on identity covariance¶

Assuming no sharing.

[mash_uv_2]
output: f"{_input:n}.mashed_independent.rds"
R: expand = '${ }'
    library(mashr)
    attach(readRDS(${_input:r}))
    ## get covariances
    U.c = cov_canonical(data.random, cov_methods= 'identity')
    ## fit mash
    model = mash(data.random, Ulist = U.c, outputlevel = 1)
    result = mash(data.strong, g=get_fitted_g(model), fixg=TRUE)
    saveRDS(result, ${_output:r})

Results summary¶

[mash_3, mash_eq_3, mash_uv_3]
output: f"{_input:n}.summary.rds"
R: expand = '${ }'
    library(mashr)
    res = readRDS(${_input:r})
    sig_res = get_significant_results(res)
    sig_eff = res$result$PosteriorMean[sig_res,]
    pi_hat = get_estimated_pi(res)
    sharing = get_pairwise_sharing(res)
    colnames(sharing) = colnames(res$result$PosteriorMean)
    n_sig_conditions = get_n_significant_conditions(res)[sig_res]
    saveRDS(list(significant_beta_est = sig_eff, pi_hat = pi_hat, pairwise_sharing = sharing, num_sig_conditions = n_sig_conditions), ${_output:r})

[mash_4, mash_eq_4, mash_uv_4]
output: f"{_input:n}.pdf"
R: expand = '${ }'
    library(lattice)
    attach(readRDS(${_input:r}))
    pdf(${_output:r})
    par(mar=c(10,1,1,1))
    barplot(pi_hat, axes=F, las = 2, cex.names = 0.6, main = 'MASH estimated weights')
    par(mar=c(1,1,1,1))
    clrs = colorRampPalette(rev(c("#D73027","#FC8D59","#FEE090","#FFFFBF",
                               "#E0F3F8","#91BFDB","#4575B4")))(64)
    pairwise_sharing[which(is.na(pairwise_sharing))] = 0
    pairwise_sharing[upper.tri(pairwise_sharing)] = NA
    levelplot(pairwise_sharing[nrow(pairwise_sharing):1,ncol(pairwise_sharing):1],col.regions = clrs,xlab = "",ylab = "", colorkey = TRUE, main = 'Sharing by magnitude')
    if (length(num_sig_conditions) > 0)
        histogram(num_sig_conditions, main = 'Number of significant conditions for significant variables')
    dev.off()