DSC syntax: basics of modules
In this document we mainly use partial DSC examples (along the lines of this toy but with advanced features not used there) to introduce basic DSC syntax. Before diving into the details, you may find DSC introduction tutorials a useful first-read to help better understand DSC syntax.
We endeavor to maintain DSC syntax simple and intuitive. This documentation covers the basic DSC syntax for users to get started quickly. We will use dedicated tutorial examples on various specific user cases separatedly, such as using other languages / command tools, using DSC in exploratory phase of research projects, and working with remote computers or HPC clusters for large scale computation. (FIXME: create these tutorials and add links to them)
A DSC file consists of one or more syntax blocks to configure modules, and one block to configure pipelines and benchmark. Here we focus on discussing modules. DSC pipelines and benchmark execution will be discussed in a separate document.
Modules
A module block is conceptually consisted of:
property
input
parameters
output
and, optionally, decorators
that provide customization to flavors of modules. Note that we do not use these exact terms as keywords in module syntax, yet as will be illustrated soon, these components distinguish from each other syntactically.
Module property
Modules, the building blocks of DSC, can be minimally specified as follows:
normal, t: rnorm.R, rt.R
n: 1000
$x: x
Here we focus on the first line (aka the “module property” line):
normal, t: rnorm.R, rt.R
The two elements separated by :
are module names and executables. The module named normal
corresponds to rnorm.R
, t
corresponds to rt.R
. For cases when there are multiple module names on the left of :
yet only one executable on the right, eg:
normal, t: simulate.R
then the two modules will share the same executable (yet, to be discussed later, possibly different module parameters).
Compound module executables
Multiple scripts can be connected with +
for one module to allow them be loaded sequentially as if they are concatenated to one script. For example:
method: utils.R + main.R
will load utils.R
followed by main.R
.
Tuple operator ()
can be used in conjuction with +
to expand the logic of concatenation, as a shorthand to define multiple compound executables. For example:
method1, method2: (utils1.R, utils2.R) + main.R
will load utils1.R
followed by main.R
for module method1
, utils2.R
followed by main.R
for module method2
.
Inline executables
Instead of having to use scripts to specify module executables, language interpreter operators can be used to define codes for modules inline. For example:
normal: R(x = rnorm(n))
n: 1000
$x: x
It can also be combined with scripts as a concatenation:
method: utils.R + R(res = main(...))
Or,
method1, method2: (utils1.R, utils2.R) + R(res = main(...))
Module Inheritance
Modules can be inherited as follows, a shorthand to define new modules based on existing ones:
normal: R( x = mu + rnorm(n) )
n: 100, 1000
mu: 0
$data: x
$true_mean: mu
shifted_normal(normal):
mu: 1
shifted_normal
is a derived module from normal
. The complete definition of shifted_normal
, after expanding the derived contents, is in fact:
shifted_normal: R( x = mu + rnorm(n) )
n: 100, 1000
mu: 1
$data: x
$true_mean: mu
The inheritance syntax makes module definitions not only succint but also conceptually more clear to define new modules – in this example we can tell that shifted_normal
is essentially normal
with different parameters.
Module inheritance can also be used to add parameters, eg,
normal: R( x = mu + rnorm(n) )
n: 100, 1000
mu: 0
$data: x
$true_mean: mu
t(normal): R( x = mu + rt(n,df) )
df: 2
Then t
is inherited from normal
with a different script and additional parameter, which will get expanded to:
t: R( x = mu + rt(n,df) )
n: 100, 1000
mu: 0
df: 2
$data: x
$true_mean: mu
It is possible to create “base” modules that are not meant to be used directly, but exclusively for other modules to derive upon. Such base modules are characterized by the absence of module executable specification, eg:
simulate_base:
n: 100, 1000
mu: 0
$data: x
$true_mean: mu
normal(simulate_base): R( x = mu + rnorm(n) )
t(simulate_base): R( x = mu + rt(n,df) )
df: 2
Module parameters
Each module parameter is an indented property under module property line. In the example above, n
is module input parameter for both modules normal
and t
, corresponding to the variable n
in both rnorm.R and rt.R scripts. Here n
is set to 1000
. When set to an array-like value, eg n: 100, 500, 1000
then effectively 3 modules are defined for normal
and t
each. Take normal
for example:
- module 1:
rnorm.R
withn
set to 100 - module 2:
rnorm.R
withn
set to 500 - module 3:
rnorm.R
withn
set to 1000
It is also possible to specify a parameter with groups of values, for example, p: (0.1, 0.9), (0.2, 0.8)
will initialize 2 modules each with parameter p
of length 2.
When multiple parameters are specified, eg:
n: 10, 20
p: 0.1, 0.2
Combinations of parameters (Cartesian product style by default) will be assigned to all modules unless otherwise instructed via @FILTER
decorator. In the example above each module will take 4 sets of parameters : (n = 10, p = 0.1), (n = 10, p = 0.2), (n = 20, p = 0.1), (n = 20, p = 0.2)
.
To get “paired” parameters instead, the Tuple operator can be used on both sides of the property. For example,
(n, p): (10, 0.1), (20, 0.2)
results in (n = 10, p = 0.1), (n = 20, p = 0.2)
.
Caution that module parameter names must match with variable names, or command-line arguments, for module executables. For example, module parameter n
is specified at DSC level and passed to rnorm.R and rt.R – neither of these executables has definition for n
because they rely on DSC to provide it. It is the user’s responsibility to ensure that they match.
Module output
Module output are pipeline variables that can be accessed by other downstream modules in the pipeline. These variables have a leading $
symbol followed by variable names. For example:
normal, t: rnorm.R, rt.R
n: 1000
$x: x
$x
on the left hand side of :
is a module output. Under the hood, the rnorm.R
script generates a variable x
, which is an n
vector of normally distributed numbers. By specifying $x: x
, x
becomes a “module output”. That is, other downstream steps will be able to use it via $x
as their input, as will be discussed later in detail.
Language specific syntax can be used to extract specific data from objects for use as module output. For example $beta: data$meta_info$beta
extracts beta
from an R
nested list and save it directly as module output beta
.
For modules whose executables are R or Python scripts, a module output can match any arbitary variable name inside the script, thus may or may not appear in module parameters. In the above example, x
is not a module parameter. For command-line executables module output will have to be files (see file()
operator below).
Module input
Module input are pipeline variables generated by other upstream modules in the pipeline. Same as module output, module input has the pipeline variable syntax $
followed by variable names. What distinguishes module input and output syntactically is whether they appear on the left side or the right side of :
. For example:
normal, t: rnorm.R, rt.R
n: 1000
$x: x
shrink: method.R
x: $x
Here in the shrink
module, $x
is a module input, whose value will be assigned to a variable x
that will be used in method.R
script. The runtime environment determines which module this variable comes from. For example if the pipeline is normal -> shrink
then module input $x
of shrink
is output $x
of normal
module.
Decorators
The word “decorator” is borrowed from the term “decorator pattern” in object-oriented programming (a design pattern that allows behavior to be added to an individual object without affecting the behavior of other objects from the same class). A DSC decorator, when applied, will modify behavior of modules specified without impacting other modules in the same syntax block. Available decorators are:
- @ < module >
- @ALIAS
- @FILTER
- @CONF
@<module>
is specifically used to adjust settings for one module, when multiple modules are specified in the same code block. The other 3 decorators can also be used even when only single module is present in a code block – this typically applies to derived modules whose behavior is adjusted by decorators to distinguish from the module they inherited.
Module as decorator
Module decorator sepcifies parameters or inputs unique to one module but not applicable to others. for example
normal, t: rnorm.R, rt.R
n: 1000
@t:
df: 5, 10
...
where n
is shared by both modules but parameter df
is only used by module t
. Module decorator always took precedence over parameters / inputs shared by modules when assigning new variables or modify existing ones, that is:
normal, t: rnorm.R, rt.R
n: 1000
@t:
n: 200
...
will set n = 200
for t
, rather than using n = 1000
.
Bulk syntax is supported:
normal, t, cauchy: rnorm.R, rt.R, rcauchy.R
n: 1000
@t,cauchy:
n: 200
Caution that module decorators cannot be used to configure module output variables, because module output are required to be the same for all modules specified in the same syntax block.
Decorator ALIAS
@ALIAS
is used to adjust the way inputs and parameters are passed to modules. DSC module parameter are of “simple” types (single elements or an array of single elements). In practice parameters specification may get complicated. @ALIAS
can be used to compose a commonly used parameter input theme: all parameters are nested inside a “key-value” system.
For example an R script method.R
accepts one “List” variable args
that contains var_a
and var_b
:
f1 = function(x) { ... }
f2 = function(x) { ... }
result = f1(args$var_a) / f2(args$var_b)
Then instead of modifying method.R
to take “simple” parameters, we can use @ALIAS
,
method: method.R
a: 1
b: 2
@ALIAS: args = List(var_a = a, var_b = b)
such that a
and b
are properly consolidated into an R List args
as var_a
and var_b
respectively. Currently we allow for List()
and Dict()
for R’s List and Python’s Dictionary though more object types may be supported as needed in future versions.
ALIAS operators
ALIAS operators, List()
and Dict()
are used in @ALIAS
decorator to help construct variables of nested List
or dictionary
structures in R
and Python
. For example for R
modules:
@ALIAS: args = List()
will convert all input / parameters the module has available, say x,y,z
to args <- list(x = ..., y = ..., z = ...)
. Likewise, List()
will convert parameters to dictionary
in Python. Partial conversion is also supported, for example args = List(x, y)
will only convert selected variables to R list which will be translated to R code args <- list(x = x, y = y)
.
Note that when a variable is made into ALIAS operator it will no longer be available as is. For example,
n: 1
m: 2
@ALIAS: args = List()
Then you cannot use m
or n
in the module script. Rather they become args$m
and args$n
. To exclude variables from being included,
n: 1
...
...
@ALIAS: args = List(!n)
Then for an R module every other parameters will be consolidated into R list args
, except for parameter n
. Multiple !
is supported, eg, List(!n, !m)
.
Another important usage of @ALIAS
is to adjust parameter names for the module specified. For example, suppose the variable n
in script rnorm.R
is n
but in rt.R
it is n_samples
, then we can use @ALIAS
to adjust module t
:
normal, t: rnorm.R, rt.R
n: 1000
@ALIAS:
t: n_samples = n
Then n
is renamed to n_samples
for input to module t
.
When both module specific @ALIAS
and global @ALIAS
are to be specified, one can either use module decorator in combination with @ALIAS
, or use wildcard *
in addition to module specific adjustment. For example the following two code blocks are equivalent:
normal, t, rcauchy: rnorm.R, rt.R, rcauchy.R
n: 1000
mu: 5
@ALIAS:
*: args = List(mu)
t: args = List(mu), n_samples = n
normal, t, rcauchy: rnorm.R, rt.R, rcauchy.R
n: 1000
mu: 5
@ALIAS: args = List(mu)
@t:
@ALIAS: args = List(mu), n_samples = n
Decorator FILTER
@FILTER
can be used to modify default logic that all parameters are combined the Cartesian product style. The @FILTER
syntax is the same as the condition
statement in dsc-query
command; both documented elsewhere. Here we elaborate on some examples:
normal, t: normal.R, t.R
n: 100, 200, 300, 400, 500
k: 0, 1
@FILTER:
t: (n <= 300 and k = 0) or (n > 300 and k = 1)
normal: n = 500
Without @FILTER
DSC will exhaust all combinations of 5 values of n
and 2 of k
for both modules, creating a total of 20 parallel jobs. The @FILTER
here states that instead of running all 20 jobs, DSC will run for t
3 values from n
with k = 0
, then run the rest 2 values of n
with k = 1
, which is a total of 5 jobs for t
. Then it runs for normal
with n = 500
combined with 2 values from k
, a total of 2 jobs. Therefore only 7 jobs will be executed after applying the @FILTER
.
When no module name is specified, eg. n
, then all modules will subject to the filter:
@FILTER: (n in [100,200,300] and k = 0)
Wildcard *
is also supported,
@FILTER:
*: n in [100,200,300]
t: n = 300
then for t
we restrict n
to 300
, and for all other modules we set n
to loop over 100,200,300
.
Decorator CONF
@CONF
provides interface to override certain properties in benchmark configuration section (DSC
section), including work_dir
, exec_path
, lib_path
, etc. See this document for more details.
Operators
Tuple operator
When used to specify module parameters, tuple operator ()
simply creates grouped values as module parameters, for example
method: method.R
K: (1,2,3), (4,5,6)
With ()
, (1,2,3)
will be translated as a group to c(1, 2, 3)
in R
, (1,2,3)
in Python
, or space separated argument sequence 1 2 3
for command-line tools. Values will be assigned in groups defined by ()
instead of separately.
When used to define module executables, in conjunction with concatenating operator +
it helps matching compound executables to each module in the same block.
Language interpreters
Currently R()/R{}
, Python()/Python{}
, and Shell()/Shell{}
are supported to specify module parameters. Codes inside these operators will be interpreted while DSC parses the configuration file; output will be evaluated as parameter values. For example n: R((1:5)*10)
results in n: (10, 20, 30, 40, 50)
and n: R{(1:5)*10}
results in n: 10, 20, 30, 40, 50
. This provides handy tool for generating input parameters with R, Python or Shell languages.
R()
and Python()
can also be used to create inline module executables. They will be interpreted at runtime as part of the module executable scripts.
Raw string indicator
In cases when we want to input a chunk of scripts as un-quoted string that may conflict with DSC parser behavior, raw()
indicator can be used to “protect” the input content from being processed by DSC, for example:
g: [1,2,3]
will result in the Python
code
g = (1,2,3)
because vector-like atomic types are converted to python “tuples” from DSC interface. But with raw()
it will result in R
code
g = [1,2,3]
as initially specified.
Another example:
normal: R(x <- rnorm(n,mean = mu,sd = 1))
mu: raw(0.0)
will result in
x <- rnorm(n,mean = 0.0,sd = 1)
However replacing raw(0.0)
with 0.0
it will result in
x <- rnorm(n,mean = 0,sd = 1)
Caution that whatever goes into raw()
will be treated as strings in the condition
parameter of dscrutils::dscquery
.
File generator
FIXME: multiple output file generator objects are not yet allowed, and will not be available until DSC 0.3.x. Convention of module output file will therefore subject to change.
When a DSC module involves use of files yet to be created by a module instance, file()
can be used to ensure a file is properly generated without racing with other module instances; or simply put, to generate unique file name strings per module instance.
File generator file()
will generate unique files by the following conventions:
- When used as module parameter and without an extension it generates a temporary file, eg,
cache: file()
. - When used as module parameter with an extension it generates files with unique names into DSC output folder. eg,
cache: file(txt)
or equivalentlycache: file(.txt)
. - When used as module output with an extension it generates files with unique name as the module output file.
Before explaining in detail, here is a couple of simple DSC modules to demonstrate the usage:
t1: R(print(paste("Parameter 'file1', an un-saved tmp file: ", file1));
print(paste("Parameter 'file2', a saved tmp file: ", file2));
print(paste("Module output 'out', a saved and tracked file: ", out));
write(0,out))
file1: file()
file2: file(txt)
$out: file(txt)
t2: R(print(paste("Module input 'data' is previous module output 'out': ", data));
out = read.table(data))
data: $out
$out: out
Then run:
$ dsc test.dsc --target "t1*t2"
$ find test -name "*.stdout" -exec tail -vn +1 {} \;
==> test/t2/t1_1_t2_1.stdout <==
[1] "Module input 'data' is previous module output 'out': test/t1/t1_1.txt"
==> test/t1/t1_1.stdout <==
[1] "Parameter 'file1', an un-saved tmp file: /tmp/RtmpxetZNc/t1_1.file1"
[1] "Parameter 'file2', a saved tmp file: test/t1/t1_1.file2.txt"
[1] "Module output 'out', a saved and tracked file: test/t1/t1_1.txt"
you see 3 types of file names printed to screen, corresponding to the 3 conventions previously introduced:
/tmp/RtmpxetZNc/t1_1.file1
test/t1/t1_1.file2.txt
test/t1/t1_1.txt
Extension not specified
When using file()
without specifing the extension, it generates a file path in the system temporary folder and can be used as a “temp file”. It will not appear in users’ work directory, and as with other system temp files there is generally no need to worry about their management / maintenance. Itis useful when there is a need to create some intermediate file for a module to run, yet there is no need to keep it for trouble-shooting. In the example above, parameter file1
is a temporary file.
Extension specified
file()
with extension will result in a unique and informative basename for use in a module instance, and will result in files under the DSC output directory. When used as a module parameter, intermediate file cache are generated and saved, but will not be tracked by DSC or used in other modules. This means deleting this file manually afterwards will not trigger rerun of the pipelines involved. When used as a module output, DSC will keep track of it as pipeline variable that other modules can use. For example:
t1: R(print(cache); write(out, 0))
n: 1, 2, 3, 4
cache: file(cache)
$out: file(txt)
Here the output out
is a file name string, generated by the module taking parameter n
in 4 different values. This will result in 4 module instance each with an output. Names of these out
files will be automatically assigned for use with the downstream pipelines.
Parameter cache
is also a file name string, but it is not a module output. This will create a situation where a properly named file *.cache
is generated to the output directory yet we do not keep track of it: We keep these files just in case we may want to examine them to debug.
One can also use the .
notation for file extensions to enhance readability, for example file(.txt)
is equivalent to file(txt)
.
Load modules from other files
We provide preliminary support to load modules from other files, using %include
, eg.
%include /path/to/other/file1
%include /path/to/other/file2
It will (recursively) load modules from other files. This feature is useful particularly in a collaborative development of a DSC that each developer can write their own module file and use a “master” DSC file to include them all.
Improving DSC script readability
Use comments
As with many scripting languages (R, Python, shell), #
can be used to make human-readable explanation to the contents of DSC script. We suggest using single #
to distinguish between various types of syntax groups introduced in this document, and double ##
to annotate parameters. For example:
# Various modules to estimate location parameter
mean, median: mean.R, median.R
# parameters
## set to 0 for no winsorization
winsorize: 0, 0.02
# input
x: $x
# output
$mean: mean
# decorators
@ALIAS:
median: w = winsorize
Turn on YAML syntax highlighter
Although DSC script is not compatible with YAML, the syntax highlighter for YAML is good enough to enhance readability of DSC scripts. You may turn on the YAML syntax highlighter in your text editor when composing DSC scripts.