Simulation

Last updated: 2019-11-06

Checks: 6 1

Knit directory: causal-TWAS/

This reproducible R Markdown analysis was created with workflowr (version 1.4.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: uncommitted changes

The R Markdown file has unstaged changes. To know which version of the R Markdown file created these results, you’ll want to first commit it to the Git repo. If you’re still working on the analysis, you can ignore this warning. When you’re finished, you can run wflow_publish to commit the R Markdown file and build the HTML.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20191103)

The command set.seed(20191103) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: f7be05a

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    code/.R
    Ignored:    code/.ipynb_checkpoints/
    Ignored:    data/

Untracked files:
    Untracked:  code/gemma_GRM.R
    Untracked:  code/impute_expression.R

Unstaged changes:
    Modified:   analysis/model.Rmd
    Modified:   analysis/simulation-WTCCC.Rmd
    Modified:   analysis/simulation-singleG.Rmd
    Deleted:    code/GEMMA_GRM.R
    Deleted:    code/simulate_WTCCC.R

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
html	56ba4ae	simingz	2019-11-03	add git icon
Rmd	68e228f	simingz	2019-11-03	model and Xin’s simulation
html	68e228f	simingz	2019-11-03	model and Xin’s simulation

We performed simulations todemonstrate the importance of accounting for pleiotropic effects ofSNPs when performing TWAS. We take real genotype data of aregion near GNL3 (a top gene in a TWAS of SCZ) from WTCCC(n= 3,000). We first simulate expression data of the gene inthe region, assuming two independent cis-eQTLs with total cis-heritability of 30%. We next simulate phenotype data under twosettings: (1) the gene has no causal effect, but one of the eQTLsacts on the phenotype (through other mechanism); (2) the genehas a causal effect, and no other SNPs in the region affect the phenotype. We compare TWAS (using Lassoas expression models) with a simple “causal TWAS” method, which performs multiple regression of phenotypeagainst all SNPs in the region plus the predicted gene expression. We use SuSiE, a Bayesian variable selectionmethod for regression. For TWAS, we use a stringentp-value threshold of 0.001 (note that we are testing a singlegene). Since SuSie does not report p-values, we call the gene effect significant if it is found in the “credible set”.In the first setting (gene has no causal effect), we found the false positive (FP) rates of TWAS are muchhigher than the causal TWAS method (Figure 3A). This is because under this model, the association of geneexpression with trait is explained away by the causal SNP. In the second setting (gene has causal effect), whileTWAS has high power as expected, the causal TWAS method also achieves appreciable power (Figure 3B).These simulations thus demonstrate the feasibility of controlling FP rates while maintaining power. We believethe full EB framework outlined below will help close the gap of power. Additionally, in our simulations, we benefitfrom the fact that the gene has two independent eQTL signals. In reality, EB approach would be critical for thesituation where a gene has single eQTL signal.

Gene has no causal effect. The effectsize of the pleiotropic SNP is measured by the Per-cent of Variance Explained (PVE) for the phenotype.(B) Gene has a causal effect, with the effect size (mea-sured by PVE) shown at the x-axis. Note that we uselarge effect size (PVE) in simulations to reduce samplesizes needed.

library(MASS)
library(susieR)
library(glmnet)

Loading required package: Matrix

Loading required package: foreach

Loaded glmnet 2.0-16

# read genotype data
file <- "data/genotype_GNL3.Rdata"
load(file)
G <- scale(mydata)
N <- nrow(G)
M <- ncol(G)

# Simulation function
# G: genotype matrix. h2.eQTL: h2 of eQTL. n.eQTL: number of eQTL SNPs. gamma: effect size of gene. theta: effect size of SNP.
simulate <- function(G, h2.eQTL, n.eQTL, gamma, theta) {
  N <- dim(G)[1]
  M <- dim(G)[2]
  
  idx.eQTL <- sample(1:M, n.eQTL)
  alpha <- rep(sqrt(h2.eQTL/n.eQTL), n.eQTL)
  
  # simulate gene expression data
  X <- G[, idx.eQTL] %*% alpha + rnorm(N, sd=1)
  
  # simulate phenotype data
  idx.SNP <- idx.eQTL[1] # choose a SNP that is one eQTL
  Y <- X * gamma + as.matrix(G[, idx.SNP]) %*% theta + rnorm(N)
  
  # return results
  return (list(X=X, Y=Y, idx.eQTL=idx.eQTL, idx.SNP))
}

# Evaluation function
evaluate <- function(G, X, Y) {
  # predicted expression
  cvfit <- cv.glmnet(G, X)
  X.tilde <- predict(cvfit, G, s = "lambda.min")
  
  # TWAS
  fit <- lm(Y ~ X.tilde)
  pvalue <- summary(fit)$coefficients["X.tilde",4]
  
  # causal-TWAS
  res <- susie(cbind(G, X.tilde), Y, L=1)
  gene.indicator <- (M+1) %in% res$sets$cs$L1
  
  # return results
  return (list(twas.pvalue=pvalue, causal.twas.gene=gene.indicator))
}

# Compare TWAS vs. causal-TWAS
h2.eQTL <- 0.3
nsims <- 50

# false positive simulation
gamma <- 0.0
theta <- 0.07
n.eQTL <- 2

twas.pvalue <- array(nsims)
causal.twas.gene <- array(nsims)
for (i in 1:nsims) {
  data <- simulate(G, h2.eQTL, n.eQTL, gamma, theta)
  result <- evaluate(G, data$X, data$Y)
  twas.pvalue[i] <- result$twas.pvalue
  causal.twas.gene[i] <- result$causal.twas.gene
}
cat("FP rate of TWAS = ", sum(twas.pvalue < 0.001)/nsims, "\n")

FP rate of TWAS =  0.36

cat("FP rate of causal TWAS = ", sum(causal.twas.gene==TRUE)/nsims, "\n")

FP rate of causal TWAS =  0.18

FP.low <- c(0.38, 0.04)
FP.high <- c(0.62, 0.14)
FP.rates <- cbind(FP.low, FP.high)
colors <- c("orange","blue")
barplot(FP.rates, col=colors, xlab="PVE of pleiotropic SNP", ylab="False positive rate", ylim=c(0,0.8), names.arg=c("0.5%", "1%"), beside=TRUE, cex.lab=1.3)
legend("topleft",c("TWAS","causal TWAS"),col=colors,pch=15, cex=1.2)

Version	Author	Date
68e228f	simingz	2019-11-03

# power simulation
n.eQTL <- 2
gamma <- 0.18
theta <- 0

twas.pvalue <- array(nsims)
causal.twas.gene <- array(nsims)
for (i in 1:nsims) {
  data <- simulate(G, h2.eQTL, n.eQTL, gamma, theta)
  result <- evaluate(G, data$X, data$Y)
  twas.pvalue[i] <- result$twas.pvalue
  causal.twas.gene[i] <- result$causal.twas.gene
}
cat("Power of TWAS = ", sum(twas.pvalue < 0.001)/nsims, "\n")

Power of TWAS =  0.94

cat("Power of causal TWAS = ", sum(causal.twas.gene==TRUE)/nsims, "\n")

Power of causal TWAS =  0.78

# plot the results
power.low <- c(0.7, 0.4)
power.high <- c(0.96, 0.68)
TP.rates <- cbind(power.low, power.high)
colors <- c("orange","blue")
barplot(TP.rates, col=colors, xlab="PVE of gene", ylab="Power", ylim=c(0,1), names.arg=c("0.5%", "1%"), beside=TRUE, cex.lab=1.3)
legend("topleft",c("TWAS","causal TWAS"),col=colors,pch=15, cex=1.2)

Version	Author	Date
68e228f	simingz	2019-11-03

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /software/openblas-0.2.19-el7-x86_64/lib/libopenblas_haswellp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glmnet_2.0-16 foreach_1.4.4 Matrix_1.2-15 susieR_0.8.0  MASS_7.3-51.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0         knitr_1.20         whisker_0.3-2     
 [4] magrittr_1.5       workflowr_1.4.0    lattice_0.20-38   
 [7] stringr_1.4.0      tools_3.5.1        grid_3.5.1        
[10] git2r_0.25.2       matrixStats_0.54.0 htmltools_0.3.6   
[13] iterators_1.0.10   yaml_2.2.0         rprojroot_1.3-2   
[16] digest_0.6.18      fs_1.3.1           codetools_0.2-15  
[19] glue_1.3.0         evaluate_0.12      rmarkdown_1.10    
[22] stringi_1.3.1      compiler_3.5.1     backports_1.1.2

Simulation – Single Gene

Siming Zhao

11/3/2019

Simulation