Last updated: 2022-01-24

Checks: 2 0

Knit directory: rss/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 0e29090. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (rmd/faq.Rmd) and HTML (docs/faq.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 0e29090 Xiang Zhu 2022-01-24 wflow_publish("rmd/faq.Rmd")
html f405f7d Xiang Zhu 2021-03-05 Build site.
Rmd 7a30989 Xiang Zhu 2021-03-05 wflow_publish("rmd/faq.Rmd")
html e66733f Xiang Zhu 2020-06-24 Build site.
Rmd 3a6660a Xiang Zhu 2020-06-24 wflow_publish("rmd/faq.Rmd")
html bab3f58 Xiang Zhu 2020-06-24 Build site.
html 68f2b9b Xiang Zhu 2020-06-23 Build site.
Rmd d40657c Xiang Zhu 2020-06-23 wflow_publish("rmd/faq.Rmd")

Please feel free to create a new issue and/or send an email to xiangzhu[at]psu.edu.

Statistics

Q: The following distribution is frequently used in RSS methods: \(\beta_j \sim \pi_j\cdot{\cal N}(0,\sigma_j^2) + (1-\pi_j)\cdot\delta_0\). How do you simulate samples from this distribution?

A: This is a mixture of a normal distribution \({\cal N}(0,\sigma_j^2)\) and a point mass at zero \(\delta_0\). To simulate numbers from this mixture distribution, we can first simulate a latent variable \(\gamma_j\) from a Bernoulli distribution which takes the value 1 with probability \(\pi_j\). If \(\gamma_j=1\), we draw the value of \(\beta_j\) from the normal distribution \({\cal N}(0,\sigma_j^2)\). If \(\gamma_j=0\), we set \(\beta_j=0\).

Q: The RSS-BVSR model contains two parameters \(\{\pi,\sigma_B^2\}\): \(\beta_j \sim \pi\cdot{\cal N}(0,\sigma_B^2) + (1-\pi)\cdot\delta_0\). How do you use the output of rss_bvsr.m to find the posterior means of \(\{\pi,\sigma_B^2\}\)?

A: To find posterior means of \(\{\pi,\sigma_B^2\}\), we need to obtain their MCMC samples from the output of rss_bvsr.m first. The MCMC samples of \(\pi\) is simply exp(logpisam), where logpisam is the MCMC samples of \(\log(\pi)\) from the output of rss_bvsr.m. The MCMC samples of \(\sigma_B^2\) is obtained as follows: (1) Get the MCMC samples of \(\{h,\pi\}\) from the output of rss_bvsr.m as hsam and exp(logpisam); (2) Use hsam and exp(logpisam) to obtain the MCMC samples of \(\sigma_B^2\), based on the first half of Equation 3.4 in Zhu and Stephens (2017) (NB: \(\rho\equiv 1\) for RSS-BVSR). Once we have the MCMC samples of \(\{\pi,\sigma_B^2\}\), we can compute their sample means to estimate posterior means of \(\{\pi,\sigma_B^2\}\).

Q: How could I reduce the computational time of Markov chain Monte Carlo (MCMC) programs provided in Zhu and Stephens (2017)?

A: You can reduce the time by shortening the length of MCMC chains, that is, setting a smaller value for Ndraw in the MCMC programs (rss/src). However, please note that using short MCMC chains may yield inaccurate inference results for large-scale problems.

Q: Zhu and Stephens (2017) reported a leave-one-out residual imputation method for model checking. Do you have software for this method?

A: Yes. I write a subroutine mvnloo.m for this check. This leave-one-out residual imputation is a simple application of multivariate normal distribution. For more details, please see Equations D.1 and D.2 on page 10 of Supplement to Zhu and Stephens (2017).

Q: How could I reduce the computational time of variational Bayes (VB) programs provided in Zhu and Stephens (2018)?

A: There are two options to reduce the time of running the VB programs (rss/src_vb). First, you can set a time limit for the VB programs, that is, setting your own options.max_walltime value. Second, you can use a more relaxed convergence tolerance, that is, setting your own options.tolerance value. Similar to MCMC, please note that stopping VB iterations early may yield inaccurate inference results for large-scale problems.

Q: Zhu and Stephens (2018) described a simple likelihood ratio calculation as a sanity check for the more sophisticated enrichment analysis based on RSS. Do you have software for this sanity check?

A: Yes. I write a stand-alone script ash_lrt_31traits.R for this sanity check. Please carefully read the instruction in this script. For more details of this sanity check, please see the caption of Supplementary Figure 17 in Zhu and Stephens (2018).

Q: Zhu and Stephens (2018) described an integrated framework based on RSS with both enrichment and prioritization components, and we are only interested in the gene prioritization component. How do we perform the gene-level association test using RSS (something like Figure 3 in the paper)?

A: Example 5 Part A gives a step-by-step illustration of gene-level association methods built on RSS, with input and output files included. Since this example is based on a relatively small simulation, it is highly recommended to run the codes and see how the methods work.

Q: Can I use the prioritization component of Zhu and Stephens (2018) as a generic gene-level association testing method, assuming that I do not have any gene set information available?

A: Yes. When gene set information is not available, one can only fit a baseline model (\(M_0:\theta=0\)) and obtain corresponding gene-level results. Indeed Figure 3 Panel A of Zhu and Stephens (2018) illustrates a possible use scenario like this.

Genetics

Q: RSS uses the shrinkage estimator of LD in Wen and Stephens (2010), the computation of which requires the “scaled population recombination rate” (\(\rho_{ij}\)). How do I compute \(\rho_{ij}\)?

A: Details of calculating \(\rho_{ij}\) are provided in these two great papers: Li and Stephens (2003) and Wen and Stephens (2010). I also write a short tutorial to illustrate the calculation of scaled population recombination rate (\(\rho_{ij}\)) using HapMap genetic map.

Q: Most published RSS analyses used phased haplotypes to estimate LD. Can I use unphased genotype data to compute this shrinkage estimator of LD?

A: Yes. You can feed get_corr.m an unphased genotype matrix and specify that isgeno=true. There is a “0.5” correction term for genotype data, which comes from a random mating assumption. Please see Section 2.4 of Wen and Stephens (2010) for more details.

Q: I know of Wen and Stephens (2010), but I’m not interested in imputation. Do you have software that just computes the banded covariance or LD matrix?

A: Yes. If you have MATLAB available, you can use get_corr.m. If you prefer R, you can use get_corr.R. I am also working on an R package ldshrink that can produce the banded LD matrix in my (very limited) spare time.

Q: In classical enrichment analyses, we often filter SNPs based on their GWAS p-values (e.g., only using SNPs with p-values smaller than \(5\times 10^{-8}\) to test enrichments). Can I apply the same SNP filter before running RSS methods for enrichment analyses?

A: No. The validity of enrichment analysis in RSS relies on an accurately inferred ‘baseline’ model (\(M_0\)). Specifically, we need to learn from the GWAS data about the average proportion of SNPs associated with a trait without any enrichment information. To get a reasonable estimate we need to consider genome-wide SNPs, including those with large p-values. Here is a toy example to understand what might go wrong if we only considered SNPs with small p-values. Suppose there are 1M independent SNPs genome wide and only 100 of them are truly trait-associated. The true proportion of trait-associated SNPs is 1e-4=100/1M. Suppose we identify 40 of 100 trait-associated SNPs with GWAS p-value <= 5e-8. If we only look at the filtered 1K SNPs with GWAS p-value <= 5e-8, then we get an overestimate 4e-2=40/1K.

Data

Q: If I want to use RSS methods, how should I prepare for the input data?

A: Thanks for your interest in trying RSS methods! To help you prepare your own data for RSS-based analyses, I write a short tutorial about input data formats. Please note that you need to store your input data in MAT-files if you plan to use the MATLAB implementation of RSS methods.

Q: Why the SiRiS matrix is coded as repmat((1./se),1,p) .* R .* repmat((1./se)',p,1), but not diag(1./se) .* R .* diag(1./se)?

A: The SiRiS matrix in our codes corresponds to \({\bf S}^{-1}{\bf R}{\bf S}^{-1}\) in our math notations. We code this matrix as repmat((1./se),1,p) .* R .* repmat((1./se)',p,1) instead of the incorrect one diag(1./se) .* R .* diag(1./se) because .* performs element-by-element multiplication in MATLAB.

Q: Zhu and Stephens (2017) reported a whole genome analysis of human adult height data based on RSS. I am wondering whether you can share this preprocessed height dataset.

A: Yes. This preprocessed dataset is publicly available at https://doi.org/10.5281/zenodo.1443565. and can be referenced in a journal’s “Data availability” section as DOI. For more information on this dataset, please see this page.

Q: Zhu and Stephens (2018) reported large-scale enrichment analyses of 4,026 gene sets across 31 complex human traits. Can you share these gene sets?

A: Yes. All 4,026 pre-processed gene sets used in this study (including 3,913 biological pathways and 113 tissue-based gene sets) are freely available at xiangzhu/rss-gsea. These gene sets can be referenced in a journal’s “Data availability” section as DOI. For more information on these gene sets, please see this page.

Software

Q: RSS is cool, but it is implemented in MATLAB, which is a proprietary programming language. Can RSS methods be made to work with some open source language?

A: To further improve the accessibility of RSS methods, I am currently working on an R package rssr in my (very limited) spare time. Comments and suggestions are welcome!