DSC Basics, Part I
This is a continuation of the Introduction to DSC. Here we revisit the same example from the first part to explain key concepts in DSC, including modules, groups and pipeline variables. More advanced concepts will be introduced in Part II.
Materials used in this tutorial can be found in the DSC vignettes repository. As before, you may choose to run this example DSC program as you read through the tutorial, but it is not required. For more details, consult the README in the “one sample location” DSC vignette.
Pipeline variables
In DSC, all information is passed from one module to another through pipeline variables. All pipeline variables are indicated with a $
as the first character, as in $data
or $mean_test_error
.
Under the hood, when a module outputs a pipeline variable, the value of this variable is saved to a file; and when a pipeline variable is provided as input to a module, the value of this variable is read from a file. You will never need to access these files directly; DSC provides a user interface that allows you to access (i.e., query) the pipeline variables without having to know where or how they are stored. In the previous tutorial we illustrated how the pipeline variables can be queried in R.
Planning a DSC file
The main aim of DSC is to make your benchmark, or computational experiment, easier to read, easier to maintain, and easier to extend. To achieve these aims, we recommend that you to start by planning out your DSC project.
We suggest starting the planning by identifying the module types (the key computational steps in your benchmark) and the pipeline variables (the quantities being fed from one step to another). All pipeline variables should be given informative names (which must begin with a $
). It is also helpful to give the modules informative names. (Note: unlike other programming languages such as R, names in DSC cannot have a period.)
Recall, in our working example, that we compared methods for estimating a population mean from a simulated data sample. Although this example was very simple—DSC can handle much more complex settings than this—this example is useful for explaining the basic ideas behind DSC.
This example followed the simulate-analyze-score design pattern, meaning that the benchmark could be naturally decomposed into three module types:
-
A
simulate
module that generates a vector of simulated data ($data
), and the population mean setting used to simulate these data ($true_mean
). -
An
analyze
module that accepts$data
as input, and outputs an estimate of the population mean ($est_mean
). -
A
score
module that accepts inputs$est_mean
and$true_mean
, and outputs an error measure ($error
).
Therefore, our DSC has four pipeline variables: $data
, $true_mean
, $est_mean
and $error
.
For clarity, we have summarized this information in the comments at the top of the DSC file (the #
character indicates a comment in a DSC file—anything after a #
is ignored by DSC):
# PIPELINE VARIABLES
# $data simulated data (vector)
# $true_mean population mean used to simulate $data (scalar)
# $est_mean population mean estimate (scalar)
# $error error in the estimate (scalar)
# MODULE TYPES
# name inputs outputs
# ---- ------ -------
# simulate none $data, $true_mean
# analyze $data $est_mean
# score: $est_mean, $true_mean $error
Two important notes about pipeline variables
-
The variables outputted by modules should also include the parts of your benchmark that you want stored for future use. For example,
$error
is here considered a module output since we would like to use this quantity to study the results of the experiment. -
The pipeline variables are the only way that modules can communicate with one another. So if a module requires access to a piece of information generated by a previous module, then this must be a pipeline variable. For example, a
score
module requires access to the true mean used to generate the data, so here this quantity is assigned to a pipeline variable$true_mean
from asimulate
module.
In summary, all information in a DSC program is local to each module unless it is defined as a module output and assigned to a pipeline variable using the $
character.
Defining modules in a DSC file
At its simplest, a DSC module consists of a name, a script (code) implementing the module, and details of how quantities are to be passed in and out of each script.
Here we illustrate the syntax by explaining each module in our example. To keep the presentation focused on the key concepts, we use a slightly simplified version of the mean estimation example that achieves the same
result; see file first_investigation_simpler.dsc
in the DSC vignettes repository for the full example.
The normal
module
The first module defined in the DSC file is the normal
module:
normal: R(x <- rnorm(n = 100,mean = 0,sd = 1))
$data: x
$true_mean: 0
(Again, this code is simplified slightly from the previous tutorial, but it achieves the same result.)
This code tells DSC three things:
-
The name of the module is “normal”.
-
The R script implementing the module is a single line of code:
x <- rnorm(n,mean = 100,sd = 1)
. Here,R()
tells DSC that this code should be parsed and evaluated as an R script. For longer code, this should be replaced with the name of a file containing the R code. Any global variables (that is, variables that are not local to a function) defined inside a script are called “script variables”; in this module, one script variable is defined,x
. -
After running the script, the pipeline variable
$data
is set to the value of script variablex
. A second pipeline variable,$true_mean
, is set to 0.
The t
module
The t
module has the same outputs as the normal
module, but generates $data
from a t distribution with a mean of 3 and 2 degrees of freedom. After running the script, the pipeline variable $data
is assigned the value of script variable x
, and $true_mean
is set to 0.
Here is the code:
t: R(x <- 3 + rt(n = 100,df = 2))
$data: x
$true_mean: 3
The two analyze
modules
Our example has two analyze
modules: the mean
module estimates the population mean by the sample mean, and the median
module estimates the population mean by the sample median.
They are defined in the DSC file as follows:
mean: R(y <- mean(x))
x: $data
$est_mean: y
median: R(y <- median(x))
x: $data
$est_mean: y
These modules differ from the simulate
modules in that they have inputs in addition to outputs:
-
The line
x: $data
specifies a module input. It tells DSC that, before running the R code, it should define a new global variablex
in the R environment, and set the value ofx
to the value of the pipeline variable$data
. The value of$data
is given by the most recently run module in the pipeline that assigned a value to$data
. -
The line
$est_mean: y
specifies a module output. It tells DSC that after running the script it should set the value of the pipeline variable$est_mean
to the value of the script variabley
.
Important note: Although the R code in the normal
, t
, median
and mean
modules all define a script variable x
, these variables are distinct (i.e., they are local to each module), and we must use a pipeline variable (here we use $data
) to pass the information on x
from one script to another.
In summary, all script variables are local to each module, and information can flow from one module to another only through pipeline variables.
The two score
modules
Finally, we create two score
modules that measure error in the estimate, one based on squared differences (sq_err
) and another based on absolute differences (abs_err
):
sq_err: R(e <- (x - y)^2)
x: $est_mean
y: $true_mean
$error: e
abs_err: R(e <- abs(x - y))
x: $est_mean
y: $true_mean
$error: e
The inputs to both modules are $est_mean
and $true_mean
, and the output is $error
.
In each of these modules, there are three script variables:
-
x
is set to the current value of pipeline variable$est_mean
. -
y
is set to the current value of pipeline variable$true_mean
. -
e
is determined by the R code, and its value is assigned to pipeline variable$error
.
This completes the module definitions.
Defining groups and pipelines
So far, we have defined the modules—that is, the individual computational steps of the DSC experiment—but we have not yet described how these modules relate to each other, and how they are used to build pipelines. This is the purpose of the DSC
block in the DSC file; the key components of the DSC
block are the define
and run
statements.
DSC:
define:
simulate: normal, t
analyze: mean, median
score: abs_err, sq_err
run: simulate * analyze * score
This code does the following:
-
DSC
is a special keyword indicating that we are defining the module groups and pipelines (that is, we are not defining a module). -
define
is another keyword indicating that we are defining module groups. -
We define three module groups:
simulate
,analyze
andscore
. The distinguishing characteristic of module groups is that they should have a similar function (and, in the simplest case, the same inputs and outputs, although this is not required). -
run
is a keyword indicating that we are defining the computational pipelines (sequences of modules) to be executed. -
The
A * B
notation asks DSC to generate all possible sequences of modules from groupsA
andB
; that is, all sequences of the form (a
,b
) wherea
is a module in groupA
andb
is a module in a groupB
. So, in this example,simulate * analyze * score
generates all pipelines that consist of a module from thesimulate
group (normal
ort
), followed by a module from theanalyze
group (mean
ormedian
), and then a module from thescore
group (sq_err
orabs_err
). In this example, there are $2 \times 2 \times 2 = 8$ different pipelines defined by thisrun
statement.
Executing the DSC benchmark
Up to this point, all we have is a bunch of code stored in a text file; the DSC file doesn’t do anything unless it is interpreted and executed by the dsc
program.
The dsc
program will run all 8 pipelines, keeping track of the values of all the script variables and pipeline variables generated in each pipeline, and it will store the values of all the module outputs, which we will retrieve for analysis in R.
To run the DSC benchmark, change the working directory to the location of the first_investigation_simpler.dsc
file. (Here we assume the dsc
repository is stored in the git
subdirectory of your home directory. If you are running the example yourself, please move to the appropriate directory on your computer.)
cd ~/git/dsc/vignettes/one_sample_location
Let’s also remove any previously generated results.
rm -Rf first_investigation.html first_investigation.log first_investigation
To keep this example as simple as possible, generate only a single replicate for each of the pipelines, and execute in parallel at most 2 modules at any one time:
dsc first_investigation_simpler.dsc --replicate 1 -c 2
INFO: DSC script exported to first_investigation.html
INFO: Constructing DSC from first_investigation_simpler.dsc ...
INFO: Building execution graph & running DSC ...
[#############################] 29 steps processed (29 jobs completed)
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time 6.206 seconds.
Inspecting the results of the DSC benchmark
To inspect the outcomes generated by each of the 8 pipelines, change the R working directory to the location of the DSC file, and use the dscquery
function from the dscrutils
package to load the DSC results into R:
setwd("~/GIT/dsc/vignettes/one_sample_location")
library(dscrutils)
dscout <-
dscquery(dsc.outdir = "first_investigation",
targets = c("simulate.true_mean","analyze.est_mean","score.error"))
dscout
The results in this table illustrate a few essential features of a DSC program:
- Looking at the “simulate”, “analyze” and “score” table columns, we can confirm that DSC has run 8 different pipelines. Each pipeline runs a different combination of the
simulate
,analyze
andscore
modules.
-
In DSC, each module input and output is assigned a different value within each pipeline. This is very different from most programs where each variable is assigned a single value. For example,
score.error
has 8 different values for each of the 8 different pipelines. (We did not include thesimulate.data
module output in this table because it is too large to show here, but its value can be extracted like the other outputs.) -
Information flows between modules within the same pipeline. In the first pipeline (the first row of the table), for example, the error in the
abs_err
module (0.05858) is calculated from (1) the value of the true mean, which was set to 0 in thenormal
module, and (2) from the estimated mean, which was set to 0.05858 in themean
module.
Pipeline evaluation, and alternatives
In the previous section, we observed that DSC generated 8 pipelines, in which each 8 pipeline is different combination of modules, and contains a different set of results. Now we explain at a high level how DSC produced these results.
-
DSC runs the two
simulate
modules,normal
andt
, then stores the values assigned to$true_mean
and$data
. Therefore, the module outputs$true_mean
and$data
have two alternative values: the value assigned by running thenormal
module, and the value assigned by running thet
module. (For example, the alternative values of$true_mean
are 0 and 3.)normal
andt
are alternative modules in thesimulate
group. Therefore, although thenormal
module appears in 4 pipelines, and thet
module also appears in 4 pipelines, eachsimulate
module only needs to be run once. -
DSC runs the two
analyze
modules,mean
andmedian
, then stores the values assigned to the module output,$est_mean
. Themean
module is run twice, once for each alternative value of$data
(the value of$data
is assigned to module inputx
in themean
andmedian
modules). Likewise, themedian
module is run twice, once for each alternative value of$data$
. Therefore, theanalyze
step is evaluated 4 times in total, and the module output$est_mean
has 4 alternative values. If you look closely at thedscout
data frame above, you will see that theanalyze.est_mean
contains 4 unique values. -
DSC runs the two
score
modules,sq_err
andabs_err
. These modules both accept two pipeline variables as input,$est_mean
and$true_mean
. Since there are 4 alternative assignments to both$est_mean
and$true_mean
, eachscore
module is evaluated 4 times, so in the end DSC stores 8 different values for the final module output,$error
.
A naive approach would have been to run the simulate
step 8 times, the analyze
step 8 times, and the score
step 8 times, but that would have been a waste of time, since many of the computations would be redundant. DSC performs the minimum amount of computation needed to generate the results for all the pipelines by generalizing the steps described here. Although wasted computation will have little noticeable effect on a small experiment such as this, this could be very important when large data sets are being simulated and analyzed.
Note that, under the hood, the order of evaluation may not be exactly as we described it here—for example, the mean
and median
modules might be evaluated with the output from mean
before the t
module is availabled—but the exact order of evaluation is unimportant for understanding how the DSC results are generated.
Recap
In this tutorial, we learned how the DSC file is used to define the key components of a DSC experiment:
-
Module inputs and outputs (pipeline variables);
-
Module scripts and script variables.
-
Module groups; and
-
Module sequences (pipelines).
We also learned:
-
How information flows between modules executed in a pipeline;
-
How values are assigned to pipeline variables separately in each pipeline;
-
How modules are evaluated, and how alternative values of module outputs are stored.
Next steps
In Part II, we will extend this example and introduce a few other useful DSC features.