I have received from GTEx fine-mapping group independent eQTL data. Here I convert it to MASH input format in order to compute posterior.
There are two sets of results, from conditional analysis and from DAP-G. Data format is very similar to fastqtl
output, except that there is an additional first column of tissue name.
Here for each data-set take the union of cross tissue results and obtain a list of gene-SNP pair IDs in YAML format, load it to R, and extract directly from the HDF5 database the relevant information. I do it for batches of $K$ genes, obtaining roughly $\frac{30000}{K}$ batches.
! sos run Independent_eQTL_Results.ipynb -h
Commands below processes the two fine-mapping results:
sos run analysis/Independent_eQTL_Results.ipynb default:1 \
--data data/independent_eQTL/DAPG_pip_gt_0.01-AllTissues.txt
sos run analysis/Independent_eQTL_Results.ipynb default:1 \
--data data/independent_eQTL/ConditionalAnalysis_AllTissues.txt
(and run on cluster:)
sos run analysis/Independent_eQTL_Results.ipynb default:2 \
--data data/independent_eQTL/DAPG_pip_gt_0.01-AllTissues.txt \
-c data/fe961153.localhost.yml
sos run analysis/Independent_eQTL_Results.ipynb default:2 \
--data data/independent_eQTL/ConditionalAnalysis_AllTissues.txt \
-c data/fe961153.localhost.yml
[global]
# A text file containing data-set names
parameter: data = path("data/independent_eQTL/DAPG_pip_gt_0.01-AllTissues.txt")
# HDF5 file of the summary statistics database
parameter: db = path("fastqtl_to_mash_output/FastQTLSumStats.h5")
# Batch size
parameter: batch_size = 5000
Workflow below extracts batches of YAML files of the format:
gene_name_1: [snp1, snp2, ...]
gene_name_2: [snp1, snp2, ...]
They will later be used to extract from the summary statistics database the gene-SNP pairs of interest.
# Generate batches of gene-SNPs
[default_1]
fail_if(not data.is_file(), msg = 'Need data file!')
input: data
output: dynamic(f"{data:n}/{data:bn}.batch_*.yaml")
bash: expand = True
mkdir -p {data:n}
python: expand = '${ }', stdout = f"{data:n}/{data:bn}.stdout"
def write_res(res, batch):
import yaml
fn = f"${data:n}/${data:bn}.batch_{batch}.yaml"
with open(fn, 'w') as outfile:
yaml.dump(res, outfile)
import pandas as pd
data = pd.read_csv(${_input:r}, sep='\t', header=0)
#data.sort_values(by=['gene_id', 'variant_id'], inplace=True)
res = dict()
idx = 1
batch = 1
print("Start ...")
for item in set(data['gene_id']):
res[item.rsplit('.',1)[0]] = sorted(set(data.loc[data['gene_id'] == item]['variant_id'].tolist()))
if idx % ${batch_size} == 0:
print(f'batch {batch}')
write_res(res, batch)
batch += 1
res = dict()
idx += 1
if len(res):
write_res(res, batch)
print("Done!")
Workflow below extracts gene-SNP pairs of interest.
FIXME: not all gene-SNPs in the YAML file exist in the summary statistics database!. Recall that the YAML file contains results from DAP-G / Conditional Regression analysis but I use their ID to extract from the original fastqtl
results.
Possible reasons of inconsistency:
fastqtl
to summary database conversion code!I saved these inconsistencies in *.batch_*.stdout
. It prints which gene has fewer SNPs in the summary statistics DB. Most of them are just 1 SNP short. There are not many, anyways.
# Run data conversion
[default_2]
depends: R_library('rhdf5'), R_library('yaml')
input: f"{data:n}/{data:bn}.batch_*.yaml", group_by = 1, concurrent = True
output: f'{_input:n}.rds'
task: trunk_workers = 1, queue = 'midway2', walltime = '2h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", stdout = f'{_output:n}.stdout'
ConvertP2Z <- function(pval, beta) {
z <- abs(qnorm(pval / 2))
z[which(beta < 0)] <- -1 * z[which(beta < 0)]
return(z)
}
GetSS <- function(table, db) {
dat <- rhdf5::h5read(db, table)
dat$"z" <- ConvertP2Z(dat$"pval", dat$"beta")
for (name in c("beta", "se", "pval", "z")) {
dat[[name]] <- as.matrix(t(dat[[name]]))
colnames(dat[[name]]) <- dat$colnames
rownames(dat[[name]]) <- dat$rownames
}
dat$colnames <- dat$rownames <- NULL
## fastqtl tend to produce nan for betahat zero
## here I set it to zero as well
dat[['se']][which(is.nan(dat[['se']]))] = 0
return(dat)
}
##
meta = yaml::read_yaml(${_input:r})
res = list()
for (gene in names(meta)) {
snps = gsub('^chr|_b38$', '', meta[[gene]])
dat = GetSS(gene, ${db:ar})
for (p in names(dat)) {
common_snps = intersect(snps, rownames(dat[[p]]))
snp_diff = length(snps) - length(common_snps)
if (snp_diff > 0 && p == 'beta')
print(paste(gene, 'is', snp_diff, 'SNP short!'))
if (length(common_snps) == 0) {
if (p == 'beta')
print(paste(gene, 'is empty!'))
next
}
dat[[p]] = dat[[p]][common_snps,,drop=F]
rownames(dat[[p]]) = paste0(gene, '_', rownames(dat[[p]]))
}
if (is.null(res$Bhat))
res$Bhat = dat[['beta']]
else
res$Bhat = rbind(res$Bhat, dat[['beta']])
if (is.null(res$Shat))
res$Shat = dat[['se']]
else
res$Shat = rbind(res$Shat, dat[['se']])
}
saveRDS(res, ${_output:r})
Exported from analysis/Independent_eQTL_Results.ipynb
committed by Gao Wang on Tue Feb 2 19:11:23 2021 revision 1, c5fe213