A Dynamic Statistical Comparison benchmark

Please read our manuscript for simulation details.

Since the simulation was based on human genotype from GTEx project which is only available via dbGaP for approved users, we cannot share the input genotype data with the public. However in this repository we share:

  1. The complete code-set for our fine-mapping benchmark
  2. The commands to run the benchmark
  3. An example run on the genotype data from DAP-G paper (biorxiv preprint on May, 2018)
  4. All output files from the benchmark including intermediate files and figures

We implemented the numerical study of the manuscript in the DSC framework. The benchmark is a single SoS Notebook file (a Jupyter notebook with SoS kernel, in JSON format) finemapping.ipynb that conslidates all scripts required for the benchmark, distributed in this repository under src folder.

To run the benchmark, first navigate to src, then export scripts from the notebook:

./export.sos

You should see a number of scripts generated under src and src/modules with extensions *.dsc, *.py and *.R. These are scripts essential to run the benchmark.

We provide a dockerfile to run the benchmark. If you want to set up the computing environment yourself please also take a look at this file to see how we configured it -- including versions of software we used for the paper.

A toy benchmark example

Here using one data-set from DAP-G paper we demonstrate now the benchmark works,

dsc susie.dsc --target run_comparison -o toy_comparison -c 8

screen output:

INFO: Load command line DSC sequence: run_comparison
INFO: DSC script exported to toy_comparison.html
INFO: Constructing DSC from susie.dsc ...
INFO: Building execution graph & running DSC ...
[#####################] 21 steps processed (183 jobs completed)
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time 6456.271 seconds.

Results from this one toy data-set can be found under toy_comparison folder.

Actual commands used for SuSiE paper

Computations of DSC are performed on The University of Chicago RCC Midway cluster. The files *.yml are relevant to settings on the cluster that assigns resource for different computational modules involved.

To compare with methods on smaller simulations scenarios for effect variable $S = 1,2,3,4,5$ on smaller data (1000 variables):

dsc susie.dsc --target run_comparison -o susie_comparison --host susie_comparison.yml -c 60

To compare with DAP on $S = 10$ on full range of genotype data (~8K variables on average):

dsc susie.dsc --target hard_case -o hard_case --host hard_case.yml -c 60

It is also possible to run the benchmark on a local computer, if --host option is dropped. The *.yml files under src folder are still relevant in providing information on required resource for each DSC module. Running time of the benchmark depends on avaiable computational resource. It is estimated that the entire benchmark will take over a week to run on a single compute node with 28 Intel Xeon E5 CPU threads (some otehr fine-mapping methods are the speed bottleneck).

Input genotype data

Please check out this pipeline for information on how input genotype data is prepared.

Benchmark data download

The benchmark data can be downloaded here, a 48GB tarball including all intermediate results generated by the benchmark (~8GB from comparison with CAVIAR, DAP-G and FINEMAP on 1000 variables, ~40GB from comparison with DAP-G on 3000~12,000$ variables). This dataset can be used to reproduce numerical studies of the manuscript as documented here.


© 2017-2018 authored by Gao Wang at Stephens Lab, The University of Chicago