Formatting independent eQTL analysis results for MASH posterior calculations¶

I have received from GTEx fine-mapping group independent eQTL data. Here I convert it to MASH input format in order to compute posterior.

Data overview¶

There are two sets of results, from conditional analysis and from DAP-G. Data format is very similar to fastqtl output, except that there is an additional first column of tissue name.

Here for each data-set take the union of cross tissue results and obtain a list of gene-SNP pair IDs in YAML format, load it to R, and extract directly from the HDF5 database the relevant information. I do it for batches of $K$ genes, obtaining roughly $\frac{30000}{K}$ batches.

! sos run Independent_eQTL_Results.ipynb -h

usage: sos run Independent_eQTL_Results.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  default

Global Workflow Options:
  --data data/independent_eQTL/DAPG_pip_gt_0.01-AllTissues.txt (as path)
                        A text file containing data-set names
  --db fastqtl_to_mash_output/FastQTLSumStats.h5 (as path)
                        HDF5 file of the summary statistics database
  --batch-size 5000 (as int)
                        Batch size

Sections
  default_1:            Generate batches of gene-SNPs
  default_2:            Run data conversion

Format conversion pipeline¶

Commands below processes the two fine-mapping results:

sos run analysis/Independent_eQTL_Results.ipynb default:1 \
    --data data/independent_eQTL/DAPG_pip_gt_0.01-AllTissues.txt
sos run analysis/Independent_eQTL_Results.ipynb default:1 \
    --data data/independent_eQTL/ConditionalAnalysis_AllTissues.txt

(and run on cluster:)

sos run analysis/Independent_eQTL_Results.ipynb default:2 \
    --data data/independent_eQTL/DAPG_pip_gt_0.01-AllTissues.txt \
    -c data/fe961153.localhost.yml
sos run analysis/Independent_eQTL_Results.ipynb default:2 \
    --data data/independent_eQTL/ConditionalAnalysis_AllTissues.txt \
    -c data/fe961153.localhost.yml

[global]
# A text file containing data-set names
parameter: data = path("data/independent_eQTL/DAPG_pip_gt_0.01-AllTissues.txt")
# HDF5 file of the summary statistics database
parameter: db = path("fastqtl_to_mash_output/FastQTLSumStats.h5")
# Batch size
parameter: batch_size = 5000

Workflow below extracts batches of YAML files of the format:

gene_name_1: [snp1, snp2, ...]
gene_name_2: [snp1, snp2, ...]

They will later be used to extract from the summary statistics database the gene-SNP pairs of interest.

# Generate batches of gene-SNPs
[default_1]
fail_if(not data.is_file(), msg = 'Need data file!')
input: data
output: dynamic(f"{data:n}/{data:bn}.batch_*.yaml")
bash: expand = True
    mkdir -p {data:n}
python: expand = '${ }', stdout = f"{data:n}/{data:bn}.stdout"
    def write_res(res, batch):
        import yaml
        fn = f"${data:n}/${data:bn}.batch_{batch}.yaml"
        with open(fn, 'w') as outfile:
            yaml.dump(res, outfile)

    import pandas as pd
    data = pd.read_csv(${_input:r}, sep='\t', header=0)
    #data.sort_values(by=['gene_id', 'variant_id'], inplace=True)
    res = dict()
    idx = 1
    batch = 1
    print("Start ...")
    for item in set(data['gene_id']):
        res[item.rsplit('.',1)[0]] = sorted(set(data.loc[data['gene_id'] == item]['variant_id'].tolist()))
        if idx % ${batch_size} == 0:
            print(f'batch {batch}')
            write_res(res, batch)
            batch += 1
            res = dict()
        idx += 1
    if len(res):
        write_res(res, batch)
    print("Done!")

Workflow below extracts gene-SNP pairs of interest.

FIXME: not all gene-SNPs in the YAML file exist in the summary statistics database!. Recall that the YAML file contains results from DAP-G / Conditional Regression analysis but I use their ID to extract from the original fastqtl results.

Possible reasons of inconsistency:

The two data-sets analyzed slightly different gene-SNP combinations
A bug in my fastqtl to summary database conversion code!

I saved these inconsistencies in *.batch_*.stdout. It prints which gene has fewer SNPs in the summary statistics DB. Most of them are just 1 SNP short. There are not many, anyways.

# Run data conversion
[default_2]
depends: R_library('rhdf5'), R_library('yaml')
input: f"{data:n}/{data:bn}.batch_*.yaml", group_by = 1, concurrent = True
output: f'{_input:n}.rds'
task: trunk_workers = 1, queue = 'midway2', walltime = '2h', trunk_size = 1, mem = '4G', cores = 1, tags = f'{_output:bn}'
R: expand = "${ }", stdout = f'{_output:n}.stdout'
    ConvertP2Z <- function(pval, beta) {
      z <- abs(qnorm(pval / 2))
      z[which(beta < 0)] <- -1 * z[which(beta < 0)]
      return(z)
    }

    GetSS <- function(table, db) {
      dat <- rhdf5::h5read(db, table)
      dat$"z" <- ConvertP2Z(dat$"pval", dat$"beta")
      for (name in c("beta", "se", "pval", "z")) {
        dat[[name]] <- as.matrix(t(dat[[name]]))
        colnames(dat[[name]]) <- dat$colnames
        rownames(dat[[name]]) <- dat$rownames
      }
      dat$colnames <- dat$rownames <- NULL
      ## fastqtl tend to produce nan for betahat zero
      ## here I set it to zero as well
      dat[['se']][which(is.nan(dat[['se']]))] = 0
      return(dat)
    }
    ##
    meta = yaml::read_yaml(${_input:r})
    res = list()
    for (gene in names(meta)) {
      snps = gsub('^chr|_b38$', '', meta[[gene]])
      dat = GetSS(gene, ${db:ar})
      for (p in names(dat)) {
        common_snps = intersect(snps, rownames(dat[[p]]))
        snp_diff = length(snps) - length(common_snps)
        if (snp_diff > 0 && p == 'beta')
          print(paste(gene, 'is', snp_diff, 'SNP short!'))
        if (length(common_snps) == 0) {
          if (p == 'beta')
            print(paste(gene, 'is empty!'))
          next
        }
        dat[[p]] = dat[[p]][common_snps,,drop=F]
        rownames(dat[[p]]) = paste0(gene, '_', rownames(dat[[p]]))
      }
      if (is.null(res$Bhat))
        res$Bhat = dat[['beta']]
      else
        res$Bhat = rbind(res$Bhat, dat[['beta']])
      if (is.null(res$Shat))
        res$Shat = dat[['se']]
      else
        res$Shat = rbind(res$Shat, dat[['se']])
    }
    saveRDS(res, ${_output:r})