This pipeline aims to extract individual-level genotype-expression data in preparation for cis-eQTL fine-mapping. It was written by Jiarun Chen (Tsinghua U.) advised by Gao Wang.
We want to organize data in units of genes. For each gene, we would like to have an RDS file that saves the genotype (sample by variants), expression (sample by condition, ie, tissues in GTEx context), and covariates data (sample by covariates). Additionally to facilicate downstream analysis, for each condition of normalized gene expression we regress out covariates and save residuals as a residual expression matrix.
Each RDS file contains information to perform association analysis of one gene,
Z
): This is independent of genes, so this is the same among genes and can be directly copied from the original covariates data, for each condition (tissues in GTEx).y
, and residual, y_res
): Each condition saves expression data for all genes in one file. We first generate a gene:line_number
meta-database -- for each gene, the line_number
is the line numbers of the gene in a expression data files. Once we have the meta-data, for each gene we read the corresponding lines from each condition and combine them as the expression matrix. This will be a lot faster than, e.g., to use grep
for genes from each file; and is easier to write than say using database type of languages. To implement,ensembl_gene_id
in the 49 expression files, using pandas
.gene:line_number
for each gene in each tissue based on the extracted column and their row index, and write them to text files as meta information.X
): Given TSS data of all genes, we extract from VCF file genotype within a predefined radius around the TSS (say, 1Mb).Genotype data is one VCF file.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GTEX-1117F GTEX-111CU GTEX-111FC GTEX-111VG GTEX-111YS GTEX-1122O
chr1 13526 chr1_13526_C_T_b38 C T . PASS AN=1676;AF=0.000596659;AC=1 GT 0|0 0|0 0|0 0|0 0|0 0|0
Expression data is per file per condition,
#chr start end gene_id GTEX-1117F GTEX-111CU GTEX-111FC GTEX-111VG GTEX-111YS GTEX-1122O GTEX-1128S
chr1 29552 29553 ENSG00000227232.5 1.3135329173730264 -0.9007944960568992 -0.29268956046586164 -0.7324113622418431 -0.27475411245874887 -0.6990255198601908 0.18188866123299216
Covariate data is per file per condition,
ID GTEX-1117F GTEX-111CU GTEX-111FC GTEX-111VG GTEX-111YS
PC1 -0.0867 0.0107 0.0099 0.0144 0.0154
PC2 -0.0132 -0.0026 -0.0050 -0.0081 -0.0093
Every gene is an RDS file containing information outlined in previous section. Below is a processed example:
dat <- readRDS(file = '/scratch/midway2/chj1ar/GTEx_Analysis_v8_eQTL_expression_genewise/output/ENSG00000284484.1.Multi_Tissues.RDS')
str(dat)
sos run 20191029_GTEx_V8_preprocessing.ipynb -h
You can edit and change the following bash variable. First edit and run the following bash variable.
work_dir=~/GTEx_Analysis_v8_eQTL_expression_genewise
Then run as follows:
cd $work_dir
sos run 20191029_GTEx_V8_preprocessing.ipynb preprocess
sos run 20191029_GTEx_V8_preprocessing.ipynb extract --analysis-ready-dir /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready
sos run 20191029_GTEx_V8_preprocessing.ipynb dap --analysis-ready-dir /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready
To run the code on UChicago RCC midway, for example,
sos run 20191029_GTEx_V8_preprocessing.ipynb extract --gene-id-file output/ensembl_gene_id.txt \
--analysis-ready-dir /project2/compbio/GTEx_eQTL/cis_eqtl_analysis_ready \
-c midway2.yml -q midway2
If you want to only extract a self-defined list, eg, first 2 genes from the gene_id_file
,
work_dir=~/GTEx_Analysis_v8_eQTL_expression_genewise
cd $work_dir
sos run 20191029_GTEx_V8_preprocessing.ipynb preprocess
head -2 output/ensembl_gene_id.txt > output/test.txt
sos run 20191029_GTEx_V8_preprocessing.ipynb extract --gene-id-file output/test.txt --analysis-ready-dir output/
[global]
import os
from glob import glob
parameter: cwd = path('./output')
parameter: expression_data = paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_expression_matrices/*.gz'))
parameter: genotype_data = path('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/genotypes/WGS/variant_calls/GTEx_Analysis_2017-06-05_v8_WholeGenomeSeq_838Indiv_Analysis_Freeze.SHAPEIT2_phased.vcf.gz')
parameter: covariates_data = paths(glob('/project2/compbio/GTEx_dbGaP/GTEx_Analysis_2017-06-05_v8/eqtl/GTEx_Analysis_v8_eQTL_covariates/*.txt'))
parameter: gene_id_file = path(f"{cwd:a}/ensembl_gene_id.txt")
parameter: gene_tss_file = path(f"{cwd:a}/ensembl_gene_tss.txt")
parameter: analysis_ready_dir = cwd
# get gene-line mapping
[get_gene_meta_1]
input: expression_data, group_by = 1
output: f"{cwd:a}/{_input:bnn}.gene_meta.json", f"{cwd:a}/{_input:bnn}.gene_meta.gz"
python3: expand = '${ }'
import pandas as pd
genes = pd.read_csv(${_input:r}, sep='\t', compression='gzip', header=0, skip_blank_lines=False, usecols = [0,1,2,3])
gene_linenum = dict([(x,y+1) for y, x in enumerate(genes['gene_id'].tolist())])
import json
with open(${_output[0]:r}, 'w') as outfile:
json.dump({${_input:r}:gene_linenum}, outfile)
# chr, start, end, gene name
genes.to_csv(${_output[1]:r}, sep = '\t', header=False, index=False)
# TSS for genes using provided TSS information
[get_gene_meta_2]
input: group_by = 'all'
output: gene_tss_file
bash: expand = '${ }', workdir = cwd
zcat ${paths(_input[1::2])} | sort -u > ${_output}
# get all gene names
[get_gene_meta_3]
output: gene_id_file
bash: expand = '${ }', workdir = cwd
awk '{print $4}' ${_input} | sort -u > ${_output}
[preprocess]
sos_run('get_gene_meta')
%cd ~/GTEx_Analysis_v8_eQTL_expression_genewise/output
wc -l ensembl_gene_tss.txt
wc -l ensembl_gene_id.txt
The following workflow is not used for the current project because biomaRt fails to find some ensembl_gene_id
. The loss is considerable, approximately from 40K to 20K. This is mostly due to gene name version inconsistencies in GTEx and in current Ensembl. For example the gene ENSG00000284523.1 is now ENSG00000284523.2 in Ensembl; ENSG00000284552.1 can no longer be found.
# # TSS for genes by biomaRt
[get_gene_meta_biomaRt]
depends: R_library("biomaRt"), R_library("data.table")
input: gene_id_file
output: gene_tss_file
R: expand = '${ }'
ensembl_gene_id <- data.table::fread(file = ${_input:r}, sep = "\n", quote = "", header = FALSE)
mart <- biomaRt::useDataset("hsapiens_gene_ensembl", biomaRt::useMart("ensembl"))
gene_TSS <- biomaRt::getBM(attributes = c("chromosome_name", "transcript_start", "transcript_end", "ensembl_gene_id", "ensembl_gene_id_version"), filters = "ensembl_gene_id_version", values = ensembl_gene_id, mart = mart)
write.table(x = gene_TSS, file = ${_output:r}, sep = '\t', quote = FALSE, col.names = TRUE, row.names = FALSE)
# gene_TSS_retrieval <- read.table(file = "gene_TSS.csv", sep = '\t', quote = "", stringsAsFactors = FALSE, header = TRUE)
[extract]
depends: executable('tabix'), R_library("data.table"), R_library('rjson'), R_library('vcfR'), R_library('dplyr')
parameter: radius = 1000000
parameter: label = 'Multi_Tissues'
genes = open(gene_id_file).read().splitlines()
input: for_each = 'genes'
output: f"{analysis_ready_dir:a}/{_genes}.{label}.rds"
# each job uses 10 nodes, each node 4 cores in parallel each core using 2G memory; and jobs are created in batches of 40.
task: trunk_workers = [4] * 10, trunk_size = 40, walltime = '20m', mem = '2G', cores = 1, tags = f'{step_name}_{_output:bn}'
bash: expand = '${ }', workdir = cwd
# X
tabix -h -f ${genotype_data} `awk '$4 ~ /${_genes}/ {print $0}' ${gene_tss_file:a} | head -n 1 | awk '{$3=$2+${radius}} {$2=$2-${radius}} {print "chr"$1":"$2"-"$3}'` > ${_genes}_genotype.vcf
R: expand = "${ }", workdir = cwd
# Z
Z <- lapply(c(${covariates_data:r,}), function(x) t(as.matrix(read.table(file = x, header = TRUE, sep = '\t', quote = "", row.names = 1))))
for (i in 1:length(Z)) {
# the format of sample names Z was changed somehow by `read.table`, from "GTEX-*" to "GTEX.*", so we need to convert it back.
rownames(Z[[i]]) <- gsub(pattern = "GTEX.", replacement = "GTEX-", x = rownames(Z[[i]]))
# add tissue names, in order match with those of y
names(Z)[i] <- strsplit(x = c(${covariates_data:br,})[i], split = "[.]")[[1]][1]
}
# y
filenames_json <- list.files(path = ${cwd:ar}, pattern = "*.json$", full.names = TRUE)
line_numbers <- lapply(filenames_json, function(x) rjson::fromJSON(file=x))
extract_y <- function(y_file, line) {
if (is.null(line)) return(NULL)
# a sed version here
# cmd <- paste0("zcat ", y_file, " | sed '", line, "q;d'")
# yi <- data.table::fread(cmd=cmd)
yi <- data.table::fread(file = y_file, skip = line - 1, nrows = 1)
samplenames_yi <- data.table::fread(file = y_file, skip = 0, nrows = 1)
colnames(yi) <- colnames(samplenames_yi)
yi <- t(as.matrix(yi))
yi <- yi[-1:-4, , drop = FALSE]
# y_file:bnn, add tissue names
# colnames(yi) <- file_path_sans_ext(file_path_sans_ext(basename(y_file)))
return(yi)
}
y <- lapply(line_numbers, function(x) extract_y(names(x), x[[1]][["${_genes}"]]))
for (i in 1:length(y)) {
# add tissue names, in order match with those of Z
names(y)[i] <- strsplit(x = basename(names(line_numbers[[i]])), split = '[.]')[[1]][1]
}
y[sapply(y, is.null)] <- NULL
## tissue names matching between y and Z (y as reference)
tissuenamesmatching <- function(x, reference) {
x[match(names(reference), names(x))]
}
Z <- tissuenamesmatching(x = Z, reference = y)
## sample names matching between y and Z (Z as reference)
samplenamesmatching <- function(x, reference) {
lapply(1:length(reference), function(i) x[[i]][match(rownames(reference[[i]]), rownames(x[[i]])), , drop = FALSE])
}
y <- samplenamesmatching(x = y, reference = Z)
for (i in 1:length(y)) {
# the tissue names are lost during `samplenamesmatching`, it is therefore needed to retrieve the tissue names
names(y)[i] <- names(Z)[i]
}
# X
vcf <- vcfR::read.vcfR("${cwd:a}/${_genes}_genotype.vcf", verbose = FALSE)
# FIXME: not sure about tri-allelic case?
X <- vcfR::extract.gt(vcf, as.numeric = T)
# y_res, residual of y
y_res <- lapply(1:length(Z), function(i) .lm.fit(x = Z[[i]], y = y[[i]])$residuals)
for (i in 1:length(y_res)) {
# add tissue names of y_res
names(y_res)[i] <- names(Z)[i]
}
# create a y_res matrix that matches X, as association analysis ready data
# code below are written by Fabio Morgante
options(stringsAsFactors=F)
###If the gene has expression values in multiple tissues
if(length(y_res)>1){
###Loop through tissues and recursively join them
for(i in 2:length(y_res)){
if(i==2){
df_a <- data.frame(id=rownames(y_res[[i-1]]), y_res[[i-1]])
colnames(df_a)[2] <- names(y_res)[[i-1]]
df_b <- data.frame(id=rownames(y_res[[i]]), y_res[[i]])
colnames(df_b)[2] <- names(y_res)[[i]]
Y_df <- dplyr::full_join(df_a, df_b, by="id")
} else {
df_b <- data.frame(id=rownames(y_res[[i]]), y_res[[i]])
colnames(df_b)[2] <- names(y_res)[[i]]
Y_df <- dplyr::full_join(Y_df, df_b, by="id")
}
}
###Assign row names as ID to Y_df and drop id column
rownames(Y_df) <- Y_df[, 1]
Y_df <- Y_df[, -1]
###If the tissue data contains the same individuals as the genotype data
if(nrow(Y_df)==ncol(X)){
###Order the rows (ID) of the joined data according to the columns (ID) of the genotype matrix
X_names <- colnames(X)
Y_mat <- as.matrix(Y_df[X_names, ])
} else if(nrow(Y_df)<ncol(X)){ ###If the tissue data contains fewer individuals than the genotype data
###Compute the individuals in common between the tissue data and the genotype data
X_names <- colnames(X)
Y_names <- rownames(Y_df)
in_common <- base::intersect(X_names, Y_names)
###Extract from the genotype data only the individuals in common between tissue data and the genotype data, and order tissue data according to the genotype data
X <- X[, which(colnames(X) %in% in_common)]
Y_mat <- as.matrix(Y_df[colnames(X), ])
} else {
stop("Error: There is a problem with IDs")
}
} else { ###If the gene has expression values in only one tissue
###Extract y_res
Y_df <- data.frame(y_res[[1]])
###If the tissue data contains the same individuals as the genotype data
if(nrow(Y_df)==ncol(X)){
###Order the rows (ID) of the joined data according to the columns (ID) of the genotype matrix
X_names <- colnames(X)
Y_mat <- as.matrix(Y_df[X_names, ])
rownames(Y_mat) <- X_names
colnames(Y_mat)[1] <- names(y_res)[1]
} else if(nrow(Y_df)<ncol(X)){ ###If the tissue data contains fewer individuals than the genotype data
###Compute the individuals in common between the tissue data and the genotype data
X_names <- colnames(X)
Y_names <- rownames(Y_df)
in_common <- base::intersect(X_names, Y_names)
###Extract from the genotype data only the individuals in common between tissue data and the genotype data, and order tissue data according to the genotype data
X <- X[, which(colnames(X) %in% in_common)]
Y_mat <- as.matrix(Y_df[colnames(X), ])
rownames(Y_mat) <- colnames(X)
colnames(Y_mat)[1] <- names(y_res)[1]
} else {
stop("Error: There is a problem with IDs")
}
}
# save
saveRDS(object = list(X = t(X), y = y, Z = Z, y_res = Y_mat), file = ${_output:r})
bash: expand = '${ }', workdir = cwd
# remove intermediate files
rm -f ${_genes}_genotype.vcf
Exported from manuscript_results/GTEx_Association_Preprocessing.ipynb
committed by Gao Wang on Sun May 24 22:12:23 2020 revision 2, 2f1a5c8