Analysis of sex-interacting QTL data. See this ticket for a motivation.
We've been working on mash anlaysis of cis-eQTL where there are lots of signals. Recently it has been brought to our attention from the community that mash tends to produce a lot more signals than we can believe, eg in transQTL studies. The rough picture is that univariate analysis does not reveal many strong associations (<50 with fdr 0.05 from ~50 conditions combined), but after mash analysis we report >1000.
We have checked for model calibration using permutation. Nothing suspicious popped up. It is possible that when there is indeed a lot of sharing thus boosting the signal a lot, but I'm still suspicious. Certainly the group who analyzed transQTL using mash found way too many. But they report to me that most weights are on data driven matrices.
One possible problem is that the data-driven matrices are learned from the <50 signals. We should either learn from, say, top 1000, or just do not use data-driven matrices. Since I (probably) do not have the data I'm not going to check this.
Another potential issue is that the input data is not well calibrated, in terms of p-value. But I believe the colleagues who raised the problem should have noticed abnormality of p-values before reporting to us (still need to double-check with them).
For our ~50 conditions problem (GTEx V8) the sex specific QTL analysis,
Plans above are implemented in the following commands:
# Full data analysis
sos run analysis/Application_siQTL.ipynb flash
sos run analysis/Application_siQTL.ipynb mash
sos run analysis/Application_siQTL.ipynb mash_eq
sos run analysis/Application_siQTL.ipynb mash_uv
# First 1/2 data
sos run analysis/Application_siQTL.ipynb mash --start 1 --end 22
sos run analysis/Application_siQTL.ipynb mash_eq --start 1 --end 22
# Second 1/2 data
sos run analysis/Application_siQTL.ipynb mash --start 23 --end 44
sos run analysis/Application_siQTL.ipynb mash_eq --start 23 --end 44
Results & intermediate files can be downloaded here.
It contains quite a few files from commands above (analysis below). Of interest are:
*.flash.model.pdf
file, demonstrating the pattern of effect sharing using a sparse matrix factorization.*.summary.rds
and *.summary.pdf
:start_1_end_44
is analysis of full data. start_1_end_22
and start_23_end_44
are halves of data. As motivated above, it would be of interest to see if the estimated significant results are mostly consistent in sign in the two halves of data. If so, then MASH is doing something reasonable. I have generated the results but did not check and plot this particular investigation. To do so, simply load *mahsed_equal_eff.summary.rds
file and look at significant_beta_est
, and compare that across results.
Univariate adaptive shrinkage analysis (setting MASH covariance to identity) in fact shrunk every effect to zero, indicating no significant effects at all when conditions are not analyzed jointly. This is consistent with the univariate analysis results. See *mahsed_independent.summary.rds
.
The rest of this notebook contain the code for the analysis -- a source to look things up just in case some quantities in those rds
files are not self-explanary.
[global]
parameter: prefix= "sbeQTL"
parameter: cwd = path('data/mash_out')
parameter: b_strong = path("data/Mashr_all.tissues_sbeQTL_all.genes.TopSNPbyZ_beta.txt.gz")
parameter: s_strong = path("data/Mashr_all.tissues_sbeQTL_all.genes.TopSNPbyZ_se.txt.gz")
parameter: b_random = path("data/Mashr_all.tissues_sbeQTL_all.genes.rand.200K_1_beta.txt.gz")
parameter: s_random = path("data/Mashr_all.tissues_sbeQTL_all.genes.rand.200K_1_se.txt.gz")
parameter: color_file = path("src/gtex_colors.txt")
# EZ or EE model; default to EZ
parameter: alpha = 1
# Subset tissue option: start index
parameter: start = 1
# Subset tissue option: end index
parameter: end = 44
# Load data
[mash_1, flash_1, mash_eq_1, mash_uv_1]
input: b_strong, s_strong, b_random, s_random, color_file
output: f"{cwd:a}/{prefix}.alpha_{alpha}_start_{start}_end_{end}.rds"
R: expand = '${ }'
library(mashr)
## read & sample data
beta.n=as.matrix(read.table(${b_random:ar}, head=T, as.is=T, row.names=1))
se.n=as.matrix(read.table(${s_random:ar}, head=T, as.is=T, row.names=1))
beta.s=as.matrix(read.table(${b_strong:ar}, head=T, as.is=T, row.names=1))
se.s=as.matrix(read.table(${s_strong:ar}, head=T, as.is=T, row.names=1))
set.seed(1)
cols = sample(1:ncol(beta.n))[${start}:${end}]
beta.n = beta.n[, sort(colnames(beta.n)[cols])]
se.n = se.n[, sort(colnames(se.n)[cols])]
beta.s = beta.s[, sort(colnames(beta.s)[cols])]
se.s = se.s[, sort(colnames(se.s)[cols])]
## estimate null correlation
Vhat = estimate_null_correlation(mash_set_data(beta.n, se.n, alpha=${alpha}), apply_lower_bound = FALSE)
## make mash files
data.random = mash_set_data(beta.n, se.n, V=Vhat, alpha=1)
data.strong = mash_set_data(beta.s, se.s, V=Vhat, alpha=1)
## get GTEx colors
get_colors = function(color_file, names) {
gtex.colors = read.table(color_file, head=T, sep = '\t', comment.char = '')
gtex.colors$tissue_id = paste0('beta_',gtex.colors$tissue_id)
gtex.colors$tissue_id[which(gtex.colors$tissue_id == 'beta_Brain_Spinal_cord_cervical_c-1')] = 'beta_Brain_Spinal_cord_cervical_c.1'
colors.index = match(names, gtex.colors$tissue_id)
colors = as.character(gtex.colors$color_hex[colors.index])
colors[which(is.na(colors))] = '000000'
colors = paste0('#',colors)
return(colors)
}
colors = get_colors(${color_file:r}, colnames(beta.s))
saveRDS(list(data.random=data.random,data.strong=data.strong, colors=colors), ${_output:r})
Using a sparse factor analysis called "FLASH", applied to top QTLs.
# Sparse factor analysis with adaptive shrinkage
[flash_2]
output: f"{_input:n}.flash.model.rds"
R: expand = '${ }', stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
library(flashr)
library(mixSQP)
library(mashr)
my_init_fn <- function(Y, K = 1) {
ret = flashr:::udv_si(Y, K)
pos_sum = sum(ret$v[ret$v > 0])
neg_sum = -sum(ret$v[ret$v < 0])
if (neg_sum > pos_sum) {
return(list(u = -ret$u, d = ret$d, v = -ret$v))
} else
return(ret)
}
flash_pipeline = function(data, ...) {
## current state-of-the art
## suggested by Jason Willwerscheid
## cf: discussion section of
## https://willwerscheid.github.io/MASHvFLASH/MASHvFLASHnn2.html
ebnm_fn = "ebnm_ash"
ebnm_param = list(l = list(mixcompdist = "normal",
optmethod = "mixSQP"),
f = list(mixcompdist = "+uniform",
optmethod = "mixSQP"))
##
fl_g <- flashr:::flash_greedy_workhorse(data,
var_type = "constant",
ebnm_fn = ebnm_fn,
ebnm_param = ebnm_param,
init_fn = "my_init_fn",
stopping_rule = "factors",
tol = 1e-3,
verbose_output = "odF")
fl_b <- flashr:::flash_backfit_workhorse(data,
f_init = fl_g,
var_type = "constant",
ebnm_fn = ebnm_fn,
ebnm_param = ebnm_param,
stopping_rule = "factors",
tol = 1e-3,
verbose_output = "odF")
return(fl_b)
}
cov_flash = function(data, subset = NULL, non_canonical = FALSE, save_model = NULL) {
if(is.null(subset)) subset = 1:mashr:::n_effects(data)
b.center = apply(data$Bhat, 2, function(x) x - mean(x))
## Only keep factors with at least two values greater than 1 / sqrt(n)
find_nonunique_effects <- function(fl) {
thresh <- 1/sqrt(ncol(fl$fitted_values))
vals_above_avg <- colSums(fl$ldf$f > thresh)
nonuniq_effects <- which(vals_above_avg > 1)
return(fl$ldf$f[, nonuniq_effects, drop = FALSE])
}
fmodel = flash_pipeline(b.center)
if (non_canonical)
flash_f = find_nonunique_effects(fmodel)
else
flash_f = fmodel$ldf$f
## row.names(flash_f) = colnames(b)
if (!is.null(save_model)) saveRDS(list(model=fmodel, factors=flash_f), save_model)
U.flash = c(cov_from_factors(t(as.matrix(flash_f)), "FLASH"),
list("tFLASH" = t(fmodel$fitted_values) %*% fmodel$fitted_values / nrow(fmodel$fitted_values)))
return(U.flash)
}
##
attach(readRDS(${_input:r}))
res = cov_flash(data.strong, non_canonical = FALSE, save_model = ${_output:r})
saveRDS(res, "${_output:nn}.cov.rds")
# Make plot of factor analysis result
[flash_3]
output: f"{_input:n}.pdf"
R: expand = "${ }"
model = readRDS(${_input:r})$model
ldf = model$ldf
pve = model$pve
colors = readRDS("${_input:nnn}.rds")$colors
options(repr.plot.width=12, repr.plot.height=5)
pdf(${_output:r})
for (i in 1:ncol(ldf$f)) {
par(mar=c(10,1,1,1))
barplot(ldf$f[,i], col=colors, names.arg=rownames(ldf$f), axes=F, main = paste("Factor", i, "PVE", round(pve[i], 3)), las = 2, cex.names = 0.6)
}
dev.off()
[mash_2]
output: f"{_input:n}.mashed.rds"
R: expand = '${ }'
library(mashr)
attach(readRDS(${_input:r}))
## get covariances
U.c = cov_canonical(data.random)
## fit mash
model = mash(data.random, Ulist = U.c, outputlevel = 1)
result = mash(data.strong, g=get_fitted_g(model), fixg=TRUE)
saveRDS(result, ${_output:r})
[mash_eq_2]
output: f"{_input:n}.mashed_equal_eff.rds"
R: expand = '${ }'
library(mashr)
attach(readRDS(${_input:r}))
## get covariances
U.c = cov_canonical(data.random, cov_methods= 'equal_effects')
## fit mash
model = mash(data.random, Ulist = U.c, outputlevel = 1)
result = mash(data.strong, g=get_fitted_g(model), fixg=TRUE)
saveRDS(result, ${_output:r})
Assuming no sharing.
[mash_uv_2]
output: f"{_input:n}.mashed_independent.rds"
R: expand = '${ }'
library(mashr)
attach(readRDS(${_input:r}))
## get covariances
U.c = cov_canonical(data.random, cov_methods= 'identity')
## fit mash
model = mash(data.random, Ulist = U.c, outputlevel = 1)
result = mash(data.strong, g=get_fitted_g(model), fixg=TRUE)
saveRDS(result, ${_output:r})
[mash_3, mash_eq_3, mash_uv_3]
output: f"{_input:n}.summary.rds"
R: expand = '${ }'
library(mashr)
res = readRDS(${_input:r})
sig_res = get_significant_results(res)
sig_eff = res$result$PosteriorMean[sig_res,]
pi_hat = get_estimated_pi(res)
sharing = get_pairwise_sharing(res)
colnames(sharing) = colnames(res$result$PosteriorMean)
n_sig_conditions = get_n_significant_conditions(res)[sig_res]
saveRDS(list(significant_beta_est = sig_eff, pi_hat = pi_hat, pairwise_sharing = sharing, num_sig_conditions = n_sig_conditions), ${_output:r})
[mash_4, mash_eq_4, mash_uv_4]
output: f"{_input:n}.pdf"
R: expand = '${ }'
library(lattice)
attach(readRDS(${_input:r}))
pdf(${_output:r})
par(mar=c(10,1,1,1))
barplot(pi_hat, axes=F, las = 2, cex.names = 0.6, main = 'MASH estimated weights')
par(mar=c(1,1,1,1))
clrs = colorRampPalette(rev(c("#D73027","#FC8D59","#FEE090","#FFFFBF",
"#E0F3F8","#91BFDB","#4575B4")))(64)
pairwise_sharing[which(is.na(pairwise_sharing))] = 0
pairwise_sharing[upper.tri(pairwise_sharing)] = NA
levelplot(pairwise_sharing[nrow(pairwise_sharing):1,ncol(pairwise_sharing):1],col.regions = clrs,xlab = "",ylab = "", colorkey = TRUE, main = 'Sharing by magnitude')
if (length(num_sig_conditions) > 0)
histogram(num_sig_conditions, main = 'Number of significant conditions for significant variables')
dev.off()
Exported from analysis/Application_siQTL.ipynb
committed by Gao Wang on Tue Feb 2 21:46:36 2021 revision 2, 18b5119