DSC syntax: basics of benchmark execution
This document is continuation of discussion in DSC syntax: basics of modules. We will use toy DSC examples (including breaking down this toy) to introduce a special syntax block called DSC that defines module ensembles, pipelines and DSC benchmark, as well as execution environment. This block has a reserved property keyword DSC and is required for all DSC scripts. In this document we refer to its contents by DSC::<key> where <key> is the identifier on the left side of : in each line of the DSC block.
Benchmark logic operators
In building a DSC benchmark, two logic are involved: “alternating” modules, or module ensemble, refers to groups of modules that achieve the same goal (able to take similar input and provide the same type of output) as if they are exchangable; “concatenating” modules, or pipelines refers to sequences of modules that executes one after another as if concatenated into a chain of actions. These logic are implemented via *, , and () operators:
*: connects concatenating modules, ie, right hand side (RHS) is executed after the left hand side (LHS).,: connects alternating modules, ie, RHS and LHS are exchangable.(): when alternating modules and concatenating modules are combined, it is used to re-define operator precedence. By default*has higher precedence than,.
Module ensembles and pipelines
DSC::define defines module ensembles and pipelines, that is, groups of “alternating” or sequences of “concatenating” modules / ensembles. For example:
normal, t: rnorm.R, rt.R
...
mean, median: mean.R, median.R
...
DSC:
define:
simulate: normal, t
estimate: mean, median
Here 2 ensembles are defined: simulate, of alternating modules normal and t for data generation, and estimate, of alternating modules mean and median for parameter estimation.
To illustrate pipelines and pipeline ensemble we can add an additional winsorize module to estimate, creating a pipeline ensemble called winsorized_estimate:
DSC:
define:
simulate: normal, t
winsorized_estimate: winsorize * (mean, median)
winsorized_estimate pipeline ensemble contains 2 pipelines winsorize * mean and winsorize * median. They are not full pipeline because they depend on input from simulate. DSC benchmark DSC::run will require one or multiple full pipelines – see below for detailed discussions.
DSC benchmark
DSC benchmark, defined by DSC::run, is one or multiple complete pipelines. We will use a series of examples to illustrate how benchmarks are created.
Example 1: styles of pipeline speicification
run: simulate * estimate * score
Here simulate and estimate are module ensembles (see DSC::define of previous section). This is equivalent to writing:
run: (normal, t) * (mean, median) * score
Example 2: concatenating module and pipeline ensembles
run: simulate * (winsorize * mean, mean) * score
Here simulate is a module ensemble, (winsorize * mean, mean) is a pipeline ensemble and score is a module. DSC will run 2 flavors of pipelines: 1) that involves winsorize followed by mean, and 2) mean only.
Example 3: expansion of pipeline operators
run: simulate * (voom, sqrt, identity) * (RUV, SVA) * (DESeq, edgeRglm, ash) * score
will be expanded to:
run: simulate * sqrt * RUV * DESeq * score,
simulate * sqrt * SVA * DESeq * score,
simulate * identity * RUV * DESeq * score,
simulate * voom * RUV * DESeq * score,
simulate * identity * SVA * DESeq * score,
simulate * voom * SVA * DESeq * score,
simulate * sqrt * RUV * ash * score,
simulate * sqrt * RUV * edgeRglm * score,
simulate * sqrt * SVA * ash * score,
simulate * sqrt * SVA * edgeRglm * score,
simulate * identity * RUV * ash * score,
simulate * voom * RUV * ash * score,
simulate * identity * RUV * edgeRglm * score,
simulate * voom * RUV * edgeRglm * score,
simulate * identity * SVA * ash * score,
simulate * voom * SVA * ash * score,
simulate * identity * SVA * edgeRglm * score,
simulate * voom * SVA * edgeRglm * score
Example 4: named pipelines
Pipelines in benchmark can have names, for example:
run:
pipeline_1: simulate * sqrt * RUV * DESeq * score
pipeline_2: simulate * sqrt * SVA * DESeq * score
Here pipeline_1 and pipeline_2 are the names of pipelines in benchmark. They are only relevant for command executions (see section below). When pipeline named default is found, eg,
run:
default: ...
other_names: ...
then executing the script without additional parameters will only execute the default pipeline. Otherwise, all pipelines will be executed just like the case of unnamed pipelines by default.
Command interface to DSC benchmark
By default, DSC will execute all pipelines defined in DSC::run. But one can override pipelines from command line. For example to execute the complete Example 1
dsc file.dsc --target "simulate * estimate * score"
or a subset of Example 1:
dsc file.dsc --target "normal * estimate * score"
You can also execute a single module / ensemble for debug purpose. Commands below are equivalent:
dsc file.dsc --target normal t
dsc file.dsc --target simulate
To execute one of the two pipelines in Example 4:
dsc file.dsc --target pipeline_1
Benchmark parameters
These parameters are specified in the rest of DSC block:
exec_path: search path (or paths) for executable fileslib_path: search path (or paths) for library files, such as R script to besource-ed or Python scripts to beimport-ed fromR_libs: check required R libraries and if not available, automatically install them from cran, bioconductor or github. 3 formats are accepted:<package>: DSC will check / installpackagefrom cran or bioconductor.<package@github_repo/subdir>: DSC will installpackagefromgithub_repoon github./subdiris optional – only applicable when the package is not in the root of the repository. egashr@stephenslab/ashr,varbvs@pcarbo/varbvs/varbvs-r.<package (version[s])>: DSC will installpackageand check version as required. A warning message will be generated if version do not match after force installation to the latest. It is possible to specify several versions such asedgeR (3.12.0, 3.12.1)or simplyedgeR (3.12.0+).python_modules: check required Python3 modules, and install from pip if not available.container: docker container images to use, eg,gaow/susie-paper:latestcontainer_engine: optionally specify engine to run containers, can be eitherdockerorsingularity. Default isdocker.seed: what to use for random number generator seed setting, can be eitherDEFAULT(use a seed based on both replicate ID and module instance ID) orREPLICATE(use DSC replicate ID as seed). Default isDEFAULT.work_dir: work directory where benchmarking will take place.output: folder name to store benchmark output files.global: defines global variables which can be used in definition of module parameters through variable wildcard$().replicate: number of replicate to run the benchmark (default is 1).
Override runtime options in modules with @CONF
Among these DSC runtime options, the following can be customized on per-module basis:
exec_path,lib_path,work_dir,R_libs,python_modules
The syntax is similar to that of @ALIAS. That is, values are assigned by =:
simulate: simulate.R
...
@CONF: exec_path = simulate_bin, lib_path = simulate_lib
and can be module specific:
normal, t: simulate.R
...
@CONF:
normal: exec_path = normal_bin
t: exec_path = t_bin
Multiple paths needs to be specified as a tuple, eg.
simulate: simulate.R
...
@CONF: lib_path = (simulate_lib, main_lib), R_libs = (ashr@stephenslab/ashr 2.2.7+, mashr 0.2.6+)
@CONF is designed not only to “fine-tune” the runtime environment of a module, but, as a “decorator”, will change behavior of a module’s execution. Specifically:
lib_pathwhen specified under@CONF, all relevant scripts (of the same type as the executable, eg.RorPy) will be kept track of. Changes to their content will result in re-execution of all instances under this module when running the benchmark.R_libswhen specified under@CONFwill automatically load all listed libraries vialibrary(...)prior to running the module script. Please useDSC::R_libsinstead if this is not the desired behavior. The same holds forpython_modules.seedcan be used to configure module specific behavior, for exampleseed = REPLICATEwill use DSC replicate ID as seed for the module. This is useful in cases where consistent behavior among module instances is desired. For example for different parameter input to a statistical method (creating different module instances), we may want to use the same seed for all these parameter inputs.
Run DSC on HPC cluster or other remote computaters
See this documentation for a discussion.