SuSiE manuscript

Individual-level genotype-expression data preprocessing for association analysis

This pipeline aims to extract individual-level genotype-expression data in preparation for cis-eQTL fine-mapping. It was written by Jiarun Chen (Tsinghua U.) advised by Gao Wang.

Aim

We want to organize data in units of genes. For each gene, we would like to have an RDS file that saves the genotype (sample by variants), expression (sample by condition, ie, tissues in GTEx context), and covariates data (sample by covariates). Additionally to facilicate downstream analysis, for each condition of normalized gene expression we regress out covariates and save residuals as a residual expression matrix.

Implementation

Each RDS file contains information to perform association analysis of one gene,

  1. covariates (Z): This is independent of genes, so this is the same among genes and can be directly copied from the original covariates data, for each condition (tissues in GTEx).
  2. expression (y, and residual, y_res): Each condition saves expression data for all genes in one file. We first generate a gene:line_number meta-database -- for each gene, the line_number is the line numbers of the gene in a expression data files. Once we have the meta-data, for each gene we read the corresponding lines from each condition and combine them as the expression matrix. This will be a lot faster than, e.g., to use grep for genes from each file; and is easier to write than say using database type of languages. To implement,
    1. extract the column of ensembl_gene_id in the 49 expression files, using pandas.
    2. assign gene:line_number for each gene in each tissue based on the extracted column and their row index, and write them to text files as meta information.
    3. for each gene, extract the row of expression of the gene using meta info from above, and combine it with sample names.
  3. genotype (X): Given TSS data of all genes, we extract from VCF file genotype within a predefined radius around the TSS (say, 1Mb).

Data format documentation

Input

Genotype data is one VCF file.

  • genotype: .vcf.gz
    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS      GTEX-1122O
    chr1    13526   chr1_13526_C_T_b38      C       T       .       PASS    AN=1676;AF=0.000596659;AC=1     GT      0|0     0|0     0|0     0|0     0|0     0|0

Expression data is per file per condition,

  • expression: .bed.gz
      #chr    start   end     gene_id GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS      GTEX-1122O      GTEX-1128S
      chr1    29552   29553   ENSG00000227232.5       1.3135329173730264      -0.9007944960568992     -0.29268956046586164    -0.7324113622418431     -0.27475411245874887    -0.6990255198601908     0.18188866123299216

Covariate data is per file per condition,

  • covariates: .txt
      ID      GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS
      PC1     -0.0867  0.0107  0.0099  0.0144  0.0154
      PC2     -0.0132 -0.0026 -0.0050 -0.0081 -0.0093

Output

Every gene is an RDS file containing information outlined in previous section. Below is a processed example:

In [4]:
dat <- readRDS(file = '/scratch/midway2/chj1ar/GTEx_Analysis_v8_eQTL_expression_genewise/output/ENSG00000284484.1.Multi_Tissues.RDS')
str(dat)
List of 4
 $ X    : chr [1:838, 1:40947] "0" "0" "0" "0" ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:838] "GTEX-1117F" "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" ...
  .. ..$ : chr [1:40947] "chr16_74677011_G_C_b38" "chr16_74677177_G_A_b38" "chr16_74677272_C_T_b38" "chr16_74677278_C_G_b38" ...
 $ y    :List of 12
  ..$ Brain_Cerebellar_Hemisphere          : chr [1:175, 1] "0.4099833" "0.7860842" "-0.8056329" "0.2007308" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:175] "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11EI6" "GTEX-11EMC" ...
  .. .. ..$ : NULL
  ..$ Brain_Cerebellum                     : chr [1:209, 1] "-0.08365173" "-1.0467" "0.8247319" "2.18935" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:209] "GTEX-111FC" "GTEX-1128S" "GTEX-117XS" "GTEX-1192X" ...
  .. .. ..$ : NULL
  ..$ Brain_Hypothalamus                   : chr [1:170, 1] "1.003148" "-0.1992013" "0.4960161" "-1.003148" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:170] "GTEX-1192X" "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11EI6" ...
  .. .. ..$ : NULL
  ..$ Brain_Nucleus_accumbens_basal_ganglia: chr [1:202, 1] "0.294373" "-0.7665489" "1.413069" "-0.5805017" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:202] "GTEX-1192X" "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11GSP" ...
  .. .. ..$ : NULL
  ..$ Nerve_Tibial                         : chr [1:532, 1] "-1.306538" "0.05881974" "0.9889212" "-0.1107433" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:532] "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" "GTEX-111YS" ...
  .. .. ..$ : NULL
  ..$ Ovary                                : chr [1:167, 1] "-0.8331469" "1.179761" "1.094332" "1.731664" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:167] "GTEX-11DXX" "GTEX-11EM3" "GTEX-11EMC" "GTEX-11GSP" ...
  .. .. ..$ : NULL
  ..$ Pituitary                            : chr [1:237, 1] "-1.465234" "0.8813078" "0.5413951" "-0.2230078" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:237] "GTEX-1128S" "GTEX-113JC" "GTEX-117XS" "GTEX-117YW" ...
  .. .. ..$ : NULL
  ..$ Spleen                               : chr [1:227, 1] "-0.3013365" "-1.098629" "-0.2216821" "0.504322" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:227] "GTEX-111CU" "GTEX-111FC" "GTEX-1122O" "GTEX-117YX" ...
  .. .. ..$ : NULL
  ..$ Testis                               : chr [1:322, 1] "0.7520643" "-0.5594548" "-1.401747" "0.4535472" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:322] "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" "GTEX-111YS" ...
  .. .. ..$ : NULL
  ..$ Thyroid                              : chr [1:574, 1] "1.276612" "0.9802626" "-1.657634" "0.4339182" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:574] "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" "GTEX-111YS" ...
  .. .. ..$ : NULL
  ..$ Uterus                               : chr [1:129, 1] "-0.7618428" "1.053073" "-1.159742" "1.426077" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:129] "GTEX-1117F" "GTEX-113JC" "GTEX-11DXX" "GTEX-11EM3" ...
  .. .. ..$ : NULL
  ..$ Vagina                               : chr [1:141, 1] "-1.250087" "0.4371924" "-0.5988157" "0.7542415" ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:141] "GTEX-1117F" "GTEX-113JC" "GTEX-11DXX" "GTEX-11EM3" ...
  .. .. ..$ : NULL
 $ Z    :List of 12
  ..$ Brain_Cerebellar_Hemisphere          : num [1:175, 1:38] -0.0941 0.0148 0.0141 0.0141 0.0147 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:175] "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11EI6" "GTEX-11EMC" ...
  .. .. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Brain_Cerebellum                     : num [1:209, 1:38] 0.0099 0.0145 0.0139 0.0138 -0.0941 0.0148 0.0141 0.0141 0.0142 0.0147 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:209] "GTEX-111FC" "GTEX-1128S" "GTEX-117XS" "GTEX-1192X" ...
  .. .. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Brain_Hypothalamus                   : num [1:170, 1:38] 0.0138 -0.0941 0.0148 0.0141 0.0141 0.0147 0.0146 0.0145 0.0161 -0.0958 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:170] "GTEX-1192X" "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11EI6" ...
  .. .. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Brain_Nucleus_accumbens_basal_ganglia: num [1:202, 1:38] 0.0138 -0.0941 0.0148 0.0147 0.0139 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:202] "GTEX-1192X" "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11GSP" ...
  .. .. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Nerve_Tibial                         : num [1:532, 1:68] 0.0107 0.0099 0.0144 0.0154 -0.0728 -0.024 0.0138 0.0141 -0.0898 -0.0941 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:532] "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" "GTEX-111YS" ...
  .. .. ..$ : chr [1:68] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Ovary                                : num [1:167, 1:38] 0.0141 0.0145 0.0141 0.0147 0.014 0.0146 0.014 0.015 0.0154 0.0144 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:167] "GTEX-11DXX" "GTEX-11EM3" "GTEX-11EMC" "GTEX-11GSP" ...
  .. .. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Pituitary                            : num [1:237, 1:38] 0.0145 0.0106 0.0139 0.0109 0.0134 0.0138 -0.0941 0.0141 0.0141 0.0091 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:237] "GTEX-1128S" "GTEX-113JC" "GTEX-117XS" "GTEX-117YW" ...
  .. .. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Spleen                               : num [1:227, 1:38] 0.0107 0.0099 0.0139 -0.024 0.0141 0.0135 0.0148 0.0139 -0.0998 0.0146 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:227] "GTEX-111CU" "GTEX-111FC" "GTEX-1122O" "GTEX-117YX" ...
  .. .. ..$ : chr [1:38] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Testis                               : num [1:322, 1:53] 0.0107 0.0099 0.0144 0.0154 0.0139 0.0109 -0.024 -0.0898 0.0141 0.0139 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:322] "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" "GTEX-111YS" ...
  .. .. ..$ : chr [1:53] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Thyroid                              : num [1:574, 1:68] 0.0107 0.0099 0.0144 0.0154 0.0139 0.0145 0.0106 0.0139 0.0109 -0.024 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:574] "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" "GTEX-111YS" ...
  .. .. ..$ : chr [1:68] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Uterus                               : num [1:129, 1:23] -0.0867 0.0106 0.0141 0.0145 0.0141 0.0147 0.014 0.0146 0.0154 0.0144 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:129] "GTEX-1117F" "GTEX-113JC" "GTEX-11DXX" "GTEX-11EM3" ...
  .. .. ..$ : chr [1:23] "PC1" "PC2" "PC3" "PC4" ...
  ..$ Vagina                               : num [1:141, 1:23] -0.0867 0.0106 0.0141 0.0145 0.0141 0.0147 0.014 0.0146 0.0161 0.014 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:141] "GTEX-1117F" "GTEX-113JC" "GTEX-11DXX" "GTEX-11EM3" ...
  .. .. ..$ : chr [1:23] "PC1" "PC2" "PC3" "PC4" ...
 $ y_res:List of 12
  ..$ Brain_Cerebellar_Hemisphere          : num [1:175, 1] 0.4994 0.258 0.0121 0.1096 0.2218 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:175] "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11EI6" "GTEX-11EMC" ...
  .. .. ..$ : NULL
  ..$ Brain_Cerebellum                     : num [1:209, 1] 0.25761 -0.40058 0.26031 0.23858 0.00136 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:209] "GTEX-111FC" "GTEX-1128S" "GTEX-117XS" "GTEX-1192X" ...
  .. .. ..$ : NULL
  ..$ Brain_Hypothalamus                   : num [1:170, 1] -0.3326 -0.0139 0.1878 0.0302 -0.1408 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:170] "GTEX-1192X" "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11EI6" ...
  .. .. ..$ : NULL
  ..$ Brain_Nucleus_accumbens_basal_ganglia: num [1:202, 1] -0.1732 -0.5435 0.366 -0.135 -0.0661 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:202] "GTEX-1192X" "GTEX-11DYG" "GTEX-11DZ1" "GTEX-11GSP" ...
  .. .. ..$ : NULL
  ..$ Nerve_Tibial                         : num [1:532, 1] -0.0746 0.0679 -0.6145 -0.029 -0.2999 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:532] "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" "GTEX-111YS" ...
  .. .. ..$ : NULL
  ..$ Ovary                                : num [1:167, 1] -0.1464 0.0702 0.4032 0.6144 0.5007 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:167] "GTEX-11DXX" "GTEX-11EM3" "GTEX-11EMC" "GTEX-11GSP" ...
  .. .. ..$ : NULL
  ..$ Pituitary                            : num [1:237, 1] -0.136526 -0.266844 -0.583977 0.000303 0.163791 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:237] "GTEX-1128S" "GTEX-113JC" "GTEX-117XS" "GTEX-117YW" ...
  .. .. ..$ : NULL
  ..$ Spleen                               : num [1:227, 1] -0.2093 0.0971 -0.081 -0.1846 -0.6929 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:227] "GTEX-111CU" "GTEX-111FC" "GTEX-1122O" "GTEX-117YX" ...
  .. .. ..$ : NULL
  ..$ Testis                               : num [1:322, 1] -0.17 0.125 0.128 0.202 0.277 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:322] "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" "GTEX-111YS" ...
  .. .. ..$ : NULL
  ..$ Thyroid                              : num [1:574, 1] -0.0422 0.4231 -0.771 -0.2314 -0.0944 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:574] "GTEX-111CU" "GTEX-111FC" "GTEX-111VG" "GTEX-111YS" ...
  .. .. ..$ : NULL
  ..$ Uterus                               : num [1:129, 1] -0.71121 -0.00343 0.31425 0.50844 -0.63224 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:129] "GTEX-1117F" "GTEX-113JC" "GTEX-11DXX" "GTEX-11EM3" ...
  .. .. ..$ : NULL
  ..$ Vagina                               : num [1:141, 1] -0.9296 0.2999 0.3062 0.0652 -0.014 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:141] "GTEX-1117F" "GTEX-113JC" "GTEX-11DXX" "GTEX-11EM3" ...
  .. .. ..$ : NULL
In [1]:
sos run 20191029_GTEx_V8_preprocessing.ipynb -h
usage: sos run 20191029_GTEx_V8_preprocessing.ipynb
               [workflow_name | -t targets] [options] [workflow_options]
  workflow_name:        Single or combined workflows defined in this script
  targets:              One or more targets to generate
  options:              Single-hyphen sos parameters (see "sos run -h" for details)
  workflow_options:     Double-hyphen workflow-specific parameters

Workflows:
  get_gene_meta
  preprocess
  get_gene_meta_biomaRt
  extract
  dap

Global Workflow Options:
  --cwd output (as path)
  --expression-data  paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_expression_matrices/*.gz'))

  --genotype-data /project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz (as path)
  --covariates-data  paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_covariates/*.txt'))

  --gene-id-file  path(f"{cwd:a}/ensembl_gene_id.txt")

  --gene-tss-file  path(f"{cwd:a}/ensembl_gene_tss.txt")

  --analysis-ready-dir  cwd


Sections
  get_gene_meta_1:      get gene-line mappping
  get_gene_meta_2:      TSS for genes using provided TSS information
  get_gene_meta_3:      get all gene names
  preprocess:
  get_gene_meta_biomaRt: # TSS for genes by biomaRt
  extract:
  dap:
    Workflow Options:
      --args ''
                        Extra arguments to pass to DAP

Usage

You can edit and change the following bash variable. First edit and run the following bash variable.

work_dir=~/GTEx_Analysis_v8_eQTL_expression_genewise

Then run as follows:

cd $work_dir
sos run 20191029_GTEx_V8_preprocessing.ipynb preprocess
sos run 20191029_GTEx_V8_preprocessing.ipynb extract --analysis-ready-dir /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready
sos run 20191029_GTEx_V8_preprocessing.ipynb dap --analysis-ready-dir /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready

To run the code on UChicago RCC midway, for example,

sos run 20191029_GTEx_V8_preprocessing.ipynb extract --gene-id-file output/ensembl_gene_id.txt \
                                    --analysis-ready-dir /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready \
                                    -c midway2.yml -q midway2

A minimal working example

If you want to only extract a self-defined list, eg, first 2 genes from the gene_id_file,

work_dir=~/GTEx_Analysis_v8_eQTL_expression_genewise
cd $work_dir
sos run 20191029_GTEx_V8_preprocessing.ipynb preprocess
head -2 output/ensembl_gene_id.txt > output/test.txt
sos run 20191029_GTEx_V8_preprocessing.ipynb extract --gene-id-file output/test.txt --analysis-ready-dir output/
In [2]:
[global]
import os
from glob import glob
parameter: cwd = path('./output')
parameter: expression_data = paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_expression_matrices/*.gz'))
parameter: genotype_data = path('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz')
parameter: covariates_data = paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_covariates/*.txt'))
parameter: gene_id_file = path(f"{cwd:a}/ensembl_gene_id.txt")
parameter: gene_tss_file = path(f"{cwd:a}/ensembl_gene_tss.txt")
parameter: analysis_ready_dir = cwd

Preprocessing

In [ ]:
# get gene-line mapping
[get_gene_meta_1]
input: expression_data, group_by = 1
output: f"{cwd:a}/{_input:bnn}.gene_meta.json", f"{cwd:a}/{_input:bnn}.gene_meta.gz"
python3: expand = '${ }'
    import pandas as pd
    genes = pd.read_csv(${_input:r}, sep='\t', compression='gzip', header=0, skip_blank_lines=False, usecols = [0,1,2,3])
    gene_linenum = dict([(x,y+1) for y, x in enumerate(genes['gene_id'].tolist())])
    import json
    with open(${_output[0]:r}, 'w') as outfile:
        json.dump({${_input:r}:gene_linenum}, outfile)
    # chr, start, end, gene name
    genes.to_csv(${_output[1]:r}, sep = '\t', header=False, index=False)
In [ ]:
# TSS for genes using provided TSS information
[get_gene_meta_2]
input: group_by = 'all'
output: gene_tss_file
bash: expand = '${ }', workdir = cwd
    zcat ${paths(_input[1::2])} | sort -u > ${_output}
In [ ]:
# get all gene names
[get_gene_meta_3]
output: gene_id_file
bash: expand = '${ }', workdir = cwd
    awk '{print $4}' ${_input} | sort -u > ${_output}
In [2]:
[preprocess]
sos_run('get_gene_meta')
In [1]:
%cd ~/GTEx_Analysis_v8_eQTL_expression_genewise/output
/home/gaow/GTEx_Analysis_v8_eQTL_expression_genewise/output
In [2]:
wc -l ensembl_gene_tss.txt
39832 ensembl_gene_tss.txt
In [3]:
wc -l ensembl_gene_id.txt
39832 ensembl_gene_id.txt

The reason why we don't use biomaRt to get TSS for genes

The following workflow is not used for the current project because biomaRt fails to find some ensembl_gene_id. The loss is considerable, approximately from 40K to 20K. This is mostly due to gene name version inconsistencies in GTEx and in current Ensembl. For example the gene ENSG00000284523.1 is now ENSG00000284523.2 in Ensembl; ENSG00000284552.1 can no longer be found.

In [ ]:
# # TSS for genes by biomaRt
[get_gene_meta_biomaRt]
depends: R_library("biomaRt"), R_library("data.table")
input: gene_id_file
output: gene_tss_file
R: expand = '${ }'
    ensembl_gene_id <- data.table::fread(file = ${_input:r}, sep = "\n", quote = "", header = FALSE)
    mart <- biomaRt::useDataset("hsapiens_gene_ensembl", biomaRt::useMart("ensembl"))
    gene_TSS <- biomaRt::getBM(attributes = c("chromosome_name", "transcript_start", "transcript_end", "ensembl_gene_id", "ensembl_gene_id_version"), filters = "ensembl_gene_id_version", values = ensembl_gene_id, mart = mart)
    write.table(x = gene_TSS, file = ${_output:r}, sep = '\t', quote = FALSE, col.names = TRUE, row.names = FALSE)
    # gene_TSS_retrieval <- read.table(file = "gene_TSS.csv", sep = '\t', quote = "", stringsAsFactors = FALSE, header = TRUE)

Data extraction

In [3]:
[extract]
depends: executable('tabix'), R_library("data.table"), R_library('rjson'), R_library('vcfR'), R_library('dplyr')
parameter: radius = 1000000
parameter: label = 'Multi_Tissues'
genes = open(gene_id_file).read().splitlines()
input: for_each = 'genes'
output: f"{analysis_ready_dir:a}/{_genes}.{label}.rds"
# each job uses 10 nodes, each node 4 cores in parallel each core using 2G memory; and jobs are created in batches of 40.
task: trunk_workers = [4] * 10, trunk_size = 40, walltime = '20m', mem = '2G', cores = 1, tags = f'{step_name}_{_output:bn}'

bash: expand = '${ }', workdir = cwd
    # X
    tabix -h -f ${genotype_data} `awk '$4 ~ /${_genes}/ {print $0}' ${gene_tss_file:a} | head -n 1 | awk '{$3=$2+${radius}} {$2=$2-${radius}} {print "chr"$1":"$2"-"$3}'` > ${_genes}_genotype.vcf

R: expand = "${ }", workdir = cwd
    # Z
    Z <- lapply(c(${covariates_data:r,}), function(x) t(as.matrix(read.table(file = x, header = TRUE, sep = '\t', quote = "", row.names = 1))))
    for (i in 1:length(Z)) {
        # the format of sample names Z was changed somehow by `read.table`, from "GTEX-*" to "GTEX.*", so we need to convert it back.
        rownames(Z[[i]]) <- gsub(pattern = "GTEX.", replacement = "GTEX-", x = rownames(Z[[i]]))
        # add tissue names, in order match with those of y
        names(Z)[i] <- strsplit(x = c(${covariates_data:br,})[i], split = "[.]")[[1]][1]
    }
    
    # y
    filenames_json <- list.files(path = ${cwd:ar}, pattern = "*.json$", full.names = TRUE)
    line_numbers <- lapply(filenames_json, function(x) rjson::fromJSON(file=x))
  
    extract_y <- function(y_file, line) {
        if (is.null(line)) return(NULL)
        # a sed version here
        # cmd <- paste0("zcat ", y_file, " | sed '", line, "q;d'")
        # yi <- data.table::fread(cmd=cmd)
        yi <- data.table::fread(file = y_file, skip = line - 1, nrows = 1)
        samplenames_yi <- data.table::fread(file = y_file, skip = 0, nrows = 1)
        colnames(yi) <- colnames(samplenames_yi)
        yi <- t(as.matrix(yi))
        yi <- yi[-1:-4, , drop = FALSE]
        # y_file:bnn, add tissue names
        # colnames(yi) <- file_path_sans_ext(file_path_sans_ext(basename(y_file)))
        return(yi)
    }
    
    y <- lapply(line_numbers, function(x) extract_y(names(x), x[[1]][["${_genes}"]]))
    for (i in 1:length(y)) {
        # add tissue names, in order match with those of Z
        names(y)[i] <- strsplit(x = basename(names(line_numbers[[i]])), split = '[.]')[[1]][1]
    }
    y[sapply(y, is.null)] <- NULL
    
    ## tissue names matching between y and Z (y as reference)
    tissuenamesmatching <- function(x, reference) {
        x[match(names(reference), names(x))]
    }
    Z <- tissuenamesmatching(x = Z, reference = y)
    
    ## sample names matching between y and Z (Z as reference)
    samplenamesmatching <- function(x, reference) {
        lapply(1:length(reference), function(i) x[[i]][match(rownames(reference[[i]]), rownames(x[[i]])), , drop = FALSE])
    }
    y <- samplenamesmatching(x = y, reference = Z)
    for (i in 1:length(y)) {
        # the tissue names are lost during `samplenamesmatching`, it is therefore needed to retrieve the tissue names
        names(y)[i] <- names(Z)[i]
    }
    
    # X
    vcf <- vcfR::read.vcfR("${cwd:a}/${_genes}_genotype.vcf", verbose = FALSE)
    # FIXME: not sure about tri-allelic case?
    X <- vcfR::extract.gt(vcf, as.numeric = T)
    
    # y_res, residual of y
    y_res <- lapply(1:length(Z), function(i) .lm.fit(x = Z[[i]], y = y[[i]])$residuals)
    for (i in 1:length(y_res)) {
        # add tissue names of y_res
        names(y_res)[i] <- names(Z)[i]
    }

    # create a y_res matrix that matches X, as association analysis ready data
    # code below are written by Fabio Morgante
    
    options(stringsAsFactors=F)
    ###If the gene has expression values in multiple tissues
    if(length(y_res)>1){
      ###Loop through tissues and recursively join them
      for(i in 2:length(y_res)){
        if(i==2){
          df_a <- data.frame(id=rownames(y_res[[i-1]]), y_res[[i-1]])
          colnames(df_a)[2] <- names(y_res)[[i-1]]
          df_b <- data.frame(id=rownames(y_res[[i]]), y_res[[i]])
          colnames(df_b)[2] <- names(y_res)[[i]]
          Y_df <- dplyr::full_join(df_a, df_b, by="id")
        } else {
          df_b <- data.frame(id=rownames(y_res[[i]]), y_res[[i]])
          colnames(df_b)[2] <- names(y_res)[[i]]
          Y_df <- dplyr::full_join(Y_df, df_b, by="id")
        }
      }

      ###Assign row names as ID to Y_df and drop id column
      rownames(Y_df) <- Y_df[, 1]
      Y_df <- Y_df[, -1]

      ###If the tissue data contains the same individuals as the genotype data
      if(nrow(Y_df)==ncol(X)){
        ###Order the rows (ID) of the joined data according to the columns (ID) of the genotype matrix 
        X_names <- colnames(X)
        Y_mat <- as.matrix(Y_df[X_names, ])
      } else if(nrow(Y_df)<ncol(X)){ ###If the tissue data contains fewer individuals than the genotype data
        ###Compute the individuals in common between the tissue data and the genotype data
        X_names <- colnames(X)
        Y_names <- rownames(Y_df)
        in_common <- base::intersect(X_names, Y_names)

        ###Extract from the genotype data only the individuals in common between tissue data and the genotype data, and order tissue data according to the genotype data
        X <- X[, which(colnames(X) %in% in_common)]
        Y_mat <- as.matrix(Y_df[colnames(X), ])
      } else {
        stop("Error: There is a problem with IDs")
      } 
    } else { ###If the gene has expression values in only one tissue
      ###Extract y_res
      Y_df <- data.frame(y_res[[1]])

      ###If the tissue data contains the same individuals as the genotype data
      if(nrow(Y_df)==ncol(X)){
        ###Order the rows (ID) of the joined data according to the columns (ID) of the genotype matrix 
        X_names <- colnames(X)
        Y_mat <- as.matrix(Y_df[X_names, ])
        rownames(Y_mat) <- X_names
        colnames(Y_mat)[1] <- names(y_res)[1]
      } else if(nrow(Y_df)<ncol(X)){ ###If the tissue data contains fewer individuals than the genotype data
        ###Compute the individuals in common between the tissue data and the genotype data
        X_names <- colnames(X)
        Y_names <- rownames(Y_df)
        in_common <- base::intersect(X_names, Y_names)

        ###Extract from the genotype data only the individuals in common between tissue data and the genotype data, and order tissue data according to the genotype data
        X <- X[, which(colnames(X) %in% in_common)]
        Y_mat <- as.matrix(Y_df[colnames(X), ])
        rownames(Y_mat) <- colnames(X)
        colnames(Y_mat)[1] <- names(y_res)[1]
      } else {
          stop("Error: There is a problem with IDs")
      } 
    }
    # save
    saveRDS(object = list(X = t(X), y = y, Z = Z, y_res = Y_mat), file = ${_output:r})

bash: expand = '${ }', workdir = cwd
    # remove intermediate files
    rm -f ${_genes}_genotype.vcf

© 2017-2018 authored by Gao Wang at Stephens Lab, The University of Chicago

Exported from manuscript_results/GTEx_Association_Preprocessing.ipynb committed by Gao Wang on Sun May 24 22:12:23 2020 revision 2, 2f1a5c8