DSC syntax: basics of benchmark execution
This document is continuation of discussion in DSC syntax: basics of modules. We will use toy DSC examples (including breaking down this toy) to introduce a special syntax block called DSC
that defines module ensembles, pipelines and DSC benchmark, as well as execution environment. This block has a reserved property keyword DSC
and is required for all DSC scripts. In this document we refer to its contents by DSC::<key>
where <key>
is the identifier on the left side of :
in each line of the DSC
block.
Benchmark logic operators
In building a DSC benchmark, two logic are involved: “alternating” modules, or module ensemble, refers to groups of modules that achieve the same goal (able to take similar input and provide the same type of output) as if they are exchangable; “concatenating” modules, or pipelines refers to sequences of modules that executes one after another as if concatenated into a chain of actions. These logic are implemented via *
, ,
and ()
operators:
*
: connects concatenating modules, ie, right hand side (RHS) is executed after the left hand side (LHS).,
: connects alternating modules, ie, RHS and LHS are exchangable.()
: when alternating modules and concatenating modules are combined, it is used to re-define operator precedence. By default*
has higher precedence than,
.
Module ensembles and pipelines
DSC::define
defines module ensembles and pipelines, that is, groups of “alternating” or sequences of “concatenating” modules / ensembles. For example:
normal, t: rnorm.R, rt.R
...
mean, median: mean.R, median.R
...
DSC:
define:
simulate: normal, t
estimate: mean, median
Here 2 ensembles are defined: simulate
, of alternating modules normal
and t
for data generation, and estimate
, of alternating modules mean
and median
for parameter estimation.
To illustrate pipelines and pipeline ensemble we can add an additional winsorize
module to estimate
, creating a pipeline ensemble called winsorized_estimate
:
DSC:
define:
simulate: normal, t
winsorized_estimate: winsorize * (mean, median)
winsorized_estimate
pipeline ensemble contains 2 pipelines winsorize * mean
and winsorize * median
. They are not full pipeline because they depend on input from simulate
. DSC benchmark DSC::run
will require one or multiple full pipelines – see below for detailed discussions.
DSC benchmark
DSC benchmark, defined by DSC::run
, is one or multiple complete pipelines. We will use a series of examples to illustrate how benchmarks are created.
Example 1: styles of pipeline speicification
run: simulate * estimate * score
Here simulate
and estimate
are module ensembles (see DSC::define
of previous section). This is equivalent to writing:
run: (normal, t) * (mean, median) * score
Example 2: concatenating module and pipeline ensembles
run: simulate * (winsorize * mean, mean) * score
Here simulate
is a module ensemble, (winsorize * mean, mean)
is a pipeline ensemble and score
is a module. DSC will run 2 flavors of pipelines: 1) that involves winsorize
followed by mean
, and 2) mean
only.
Example 3: expansion of pipeline operators
run: simulate * (voom, sqrt, identity) * (RUV, SVA) * (DESeq, edgeRglm, ash) * score
will be expanded to:
run: simulate * sqrt * RUV * DESeq * score,
simulate * sqrt * SVA * DESeq * score,
simulate * identity * RUV * DESeq * score,
simulate * voom * RUV * DESeq * score,
simulate * identity * SVA * DESeq * score,
simulate * voom * SVA * DESeq * score,
simulate * sqrt * RUV * ash * score,
simulate * sqrt * RUV * edgeRglm * score,
simulate * sqrt * SVA * ash * score,
simulate * sqrt * SVA * edgeRglm * score,
simulate * identity * RUV * ash * score,
simulate * voom * RUV * ash * score,
simulate * identity * RUV * edgeRglm * score,
simulate * voom * RUV * edgeRglm * score,
simulate * identity * SVA * ash * score,
simulate * voom * SVA * ash * score,
simulate * identity * SVA * edgeRglm * score,
simulate * voom * SVA * edgeRglm * score
Example 4: named pipelines
Pipelines in benchmark can have names, for example:
run:
pipeline_1: simulate * sqrt * RUV * DESeq * score
pipeline_2: simulate * sqrt * SVA * DESeq * score
Here pipeline_1
and pipeline_2
are the names of pipelines in benchmark. They are only relevant for command executions (see section below). When pipeline named default
is found, eg,
run:
default: ...
other_names: ...
then executing the script without additional parameters will only execute the default
pipeline. Otherwise, all pipelines will be executed just like the case of unnamed pipelines by default.
Command interface to DSC benchmark
By default, DSC will execute all pipelines defined in DSC::run
. But one can override pipelines from command line. For example to execute the complete Example 1
dsc file.dsc --target "simulate * estimate * score"
or a subset of Example 1
:
dsc file.dsc --target "normal * estimate * score"
You can also execute a single module / ensemble for debug purpose. Commands below are equivalent:
dsc file.dsc --target normal t
dsc file.dsc --target simulate
To execute one of the two pipelines in Example 4:
dsc file.dsc --target pipeline_1
Benchmark parameters
These parameters are specified in the rest of DSC
block:
exec_path
: search path (or paths) for executable fileslib_path
: search path (or paths) for library files, such as R script to besource
-ed or Python scripts to beimport
-ed fromR_libs
: check required R libraries and if not available, automatically install them from cran, bioconductor or github. 3 formats are accepted:<package>
: DSC will check / installpackage
from cran or bioconductor.<package@github_repo/subdir>
: DSC will installpackage
fromgithub_repo
on github./subdir
is optional – only applicable when the package is not in the root of the repository. egashr@stephenslab/ashr
,varbvs@pcarbo/varbvs/varbvs-r
.<package (version[s])>
: DSC will installpackage
and check version as required. A warning message will be generated if version do not match after force installation to the latest. It is possible to specify several versions such asedgeR (3.12.0, 3.12.1)
or simplyedgeR (3.12.0+)
.python_modules
: check required Python3 modules, and install from pip if not available.container
: docker container images to use, eg,gaow/susie-paper:latest
container_engine
: optionally specify engine to run containers, can be eitherdocker
orsingularity
. Default isdocker
.seed
: what to use for random number generator seed setting, can be eitherDEFAULT
(use a seed based on both replicate ID and module instance ID) orREPLICATE
(use DSC replicate ID as seed). Default isDEFAULT
.work_dir
: work directory where benchmarking will take place.output
: folder name to store benchmark output files.global
: defines global variables which can be used in definition of module parameters through variable wildcard$()
.replicate
: number of replicate to run the benchmark (default is 1).
Override runtime options in modules with @CONF
Among these DSC runtime options, the following can be customized on per-module basis:
exec_path
,lib_path
,work_dir
,R_libs
,python_modules
The syntax is similar to that of @ALIAS
. That is, values are assigned by =
:
simulate: simulate.R
...
@CONF: exec_path = simulate_bin, lib_path = simulate_lib
and can be module specific:
normal, t: simulate.R
...
@CONF:
normal: exec_path = normal_bin
t: exec_path = t_bin
Multiple paths needs to be specified as a tuple, eg.
simulate: simulate.R
...
@CONF: lib_path = (simulate_lib, main_lib), R_libs = (ashr@stephenslab/ashr 2.2.7+, mashr 0.2.6+)
@CONF
is designed not only to “fine-tune” the runtime environment of a module, but, as a “decorator”, will change behavior of a module’s execution. Specifically:
lib_path
when specified under@CONF
, all relevant scripts (of the same type as the executable, eg.R
orPy
) will be kept track of. Changes to their content will result in re-execution of all instances under this module when running the benchmark.R_libs
when specified under@CONF
will automatically load all listed libraries vialibrary(...)
prior to running the module script. Please useDSC::R_libs
instead if this is not the desired behavior. The same holds forpython_modules
.seed
can be used to configure module specific behavior, for exampleseed = REPLICATE
will use DSC replicate ID as seed for the module. This is useful in cases where consistent behavior among module instances is desired. For example for different parameter input to a statistical method (creating different module instances), we may want to use the same seed for all these parameter inputs.
Run DSC on HPC cluster or other remote computaters
See this documentation for a discussion.