DSC Basics, Part I
This is a continuation of the Introduction to DSC. Here we revisit the same example from the first part to explain key concepts in DSC, including modules, groups and pipeline variables. More advanced concepts will be introduced in Part II.
Materials used in this tutorial can be found in the DSC vignettes repository. As before, you may choose to run this example DSC program as you read through the tutorial, but it is not required. For more details, consult the README in the “one sample location” DSC vignette.
Pipeline variables
In DSC, all information is passed from one module to another through pipeline variables. All pipeline variables are indicated with a $ as the first character, as in $data or $mean_test_error.
Under the hood, when a module outputs a pipeline variable, the value of this variable is saved to a file; and when a pipeline variable is provided as input to a module, the value of this variable is read from a file. You will never need to access these files directly; DSC provides a user interface that allows you to access (i.e., query) the pipeline variables without having to know where or how they are stored. In the previous tutorial we illustrated how the pipeline variables can be queried in R.
Planning a DSC file
The main aim of DSC is to make your benchmark, or computational experiment, easier to read, easier to maintain, and easier to extend. To achieve these aims, we recommend that you to start by planning out your DSC project.
We suggest starting the planning by identifying the module types (the key computational steps in your benchmark) and the pipeline variables (the quantities being fed from one step to another). All pipeline variables should be given informative names (which must begin with a $). It is also helpful to give the modules informative names. (Note: unlike other programming languages such as R, names in DSC cannot have a period.)
Recall, in our working example, that we compared methods for estimating a population mean from a simulated data sample. Although this example was very simple—DSC can handle much more complex settings than this—this example is useful for explaining the basic ideas behind DSC.
This example followed the simulate-analyze-score design pattern, meaning that the benchmark could be naturally decomposed into three module types:
- 
    A simulatemodule that generates a vector of simulated data ($data), and the population mean setting used to simulate these data ($true_mean).
- 
    An analyzemodule that accepts$dataas input, and outputs an estimate of the population mean ($est_mean).
- 
    A scoremodule that accepts inputs$est_meanand$true_mean, and outputs an error measure ($error).
Therefore, our DSC has four pipeline variables: $data, $true_mean, $est_mean and $error.
For clarity, we have summarized this information in the comments at the top of the DSC file (the # character indicates a comment in a DSC file—anything after a # is ignored by DSC):
# PIPELINE VARIABLES
# $data       simulated data (vector)
# $true_mean  population mean used to simulate $data (scalar)
# $est_mean   population mean estimate (scalar)
# $error      error in the estimate (scalar)
# MODULE TYPES
# name     inputs                outputs
# ----     ------                -------
# simulate none                  $data, $true_mean
# analyze  $data                 $est_mean
# score:   $est_mean, $true_mean $error
Two important notes about pipeline variables
- 
    The variables outputted by modules should also include the parts of your benchmark that you want stored for future use. For example, $erroris here considered a module output since we would like to use this quantity to study the results of the experiment.
- 
    The pipeline variables are the only way that modules can communicate with one another. So if a module requires access to a piece of information generated by a previous module, then this must be a pipeline variable. For example, a scoremodule requires access to the true mean used to generate the data, so here this quantity is assigned to a pipeline variable$true_meanfrom asimulatemodule.
In summary, all information in a DSC program is local to each module unless it is defined as a module output and assigned to a pipeline variable using the $ character.
Defining modules in a DSC file
At its simplest, a DSC module consists of a name, a script (code) implementing the module, and details of how quantities are to be passed in and out of each script.
Here we illustrate the syntax by explaining each module in our example. To keep the presentation focused on the key concepts, we use a slightly simplified version of the mean estimation example that achieves the same
result; see file first_investigation_simpler.dsc in the DSC vignettes repository for the full example.
The normal module
The first module defined in the DSC file is the normal module:
normal: R(x <- rnorm(n = 100,mean = 0,sd = 1))
  $data: x
  $true_mean: 0
(Again, this code is simplified slightly from the previous tutorial, but it achieves the same result.)
This code tells DSC three things:
- 
    The name of the module is “normal”. 
- 
    The R script implementing the module is a single line of code: x <- rnorm(n,mean = 100,sd = 1). Here,R()tells DSC that this code should be parsed and evaluated as an R script. For longer code, this should be replaced with the name of a file containing the R code. Any global variables (that is, variables that are not local to a function) defined inside a script are called “script variables”; in this module, one script variable is defined,x.
- 
    After running the script, the pipeline variable $datais set to the value of script variablex. A second pipeline variable,$true_mean, is set to 0.
The t module
The t module has the same outputs as the normal module, but generates $data from a t distribution with a mean of 3 and 2 degrees of freedom. After running the script, the pipeline variable $data is assigned the value of script variable x, and $true_mean is set to 0.
Here is the code:
t: R(x <- 3 + rt(n = 100,df = 2))
  $data: x
  $true_mean: 3
The two analyze modules
Our example has two analyze modules: the mean module estimates the population mean by the sample mean, and the median module estimates the population mean by the sample median.
They are defined in the DSC file as follows:
mean: R(y <- mean(x))
  x: $data
  $est_mean: y
median: R(y <- median(x))
  x: $data
  $est_mean: y
These modules differ from the simulate modules in that they have inputs in addition to outputs:
- 
    The line x: $dataspecifies a module input. It tells DSC that, before running the R code, it should define a new global variablexin the R environment, and set the value ofxto the value of the pipeline variable$data. The value of$datais given by the most recently run module in the pipeline that assigned a value to$data.
- 
    The line $est_mean: yspecifies a module output. It tells DSC that after running the script it should set the value of the pipeline variable$est_meanto the value of the script variabley.
Important note: Although the R code in the normal, t, median and mean modules all define a script variable x, these variables are distinct (i.e., they are local to each module), and we must use a pipeline variable (here we use $data) to pass the information on x from one script to another.
In summary, all script variables are local to each module, and information can flow from one module to another only through pipeline variables.
The two score modules
Finally, we create two score modules that measure error in the estimate, one based on squared differences (sq_err) and another based on absolute differences (abs_err):
sq_err: R(e <- (x - y)^2)
  x: $est_mean
  y: $true_mean
  $error: e
abs_err: R(e <- abs(x - y))
  x: $est_mean
  y: $true_mean
  $error: e
The inputs to both modules are $est_mean and $true_mean, and the output is $error.
In each of these modules, there are three script variables:
- 
    xis set to the current value of pipeline variable$est_mean.
- 
    yis set to the current value of pipeline variable$true_mean.
- 
    eis determined by the R code, and its value is assigned to pipeline variable$error.
This completes the module definitions.
Defining groups and pipelines
So far, we have defined the modules—that is, the individual computational steps of the DSC experiment—but we have not yet described how these modules relate to each other, and how they are used to build pipelines. This is the purpose of the DSC block in the DSC file; the key components of the DSC block are the define and run statements.
DSC:
  define:
    simulate: normal, t
    analyze: mean, median
    score: abs_err, sq_err
  run: simulate * analyze * score
This code does the following:
- 
    DSCis a special keyword indicating that we are defining the module groups and pipelines (that is, we are not defining a module).
- 
    defineis another keyword indicating that we are defining module groups.
- 
    We define three module groups: simulate,analyzeandscore. The distinguishing characteristic of module groups is that they should have a similar function (and, in the simplest case, the same inputs and outputs, although this is not required).
- 
    runis a keyword indicating that we are defining the computational pipelines (sequences of modules) to be executed.
- 
    The A * Bnotation asks DSC to generate all possible sequences of modules from groupsAandB; that is, all sequences of the form (a,b) whereais a module in groupAandbis a module in a groupB. So, in this example,simulate * analyze * scoregenerates all pipelines that consist of a module from thesimulategroup (normalort), followed by a module from theanalyzegroup (meanormedian), and then a module from thescoregroup (sq_errorabs_err). In this example, there are $2 \times 2 \times 2 = 8$ different pipelines defined by thisrunstatement.
Executing the DSC benchmark
Up to this point, all we have is a bunch of code stored in a text file; the DSC file doesn’t do anything unless it is interpreted and executed by the dsc program.
The dsc program will run all 8 pipelines, keeping track of the values of all the script variables and pipeline variables generated in each pipeline, and it will store the values of all the module outputs, which we will retrieve for analysis in R.
To run the DSC benchmark, change the working directory to the location of the first_investigation_simpler.dsc file. (Here we assume the dsc repository is stored in the git subdirectory of your home directory. If you are running the example yourself, please move to the appropriate directory on your computer.)
cd ~/git/dsc/vignettes/one_sample_location
Let’s also remove any previously generated results.
rm -Rf first_investigation.html first_investigation.log first_investigation
To keep this example as simple as possible, generate only a single replicate for each of the pipelines, and execute in parallel at most 2 modules at any one time:
dsc first_investigation_simpler.dsc --replicate 1 -c 2
INFO: DSC script exported to first_investigation.html
INFO: Constructing DSC from first_investigation_simpler.dsc ...
INFO: Building execution graph & running DSC ...
[#############################] 29 steps processed (29 jobs completed)
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time 6.206 seconds.
Inspecting the results of the DSC benchmark
To inspect the outcomes generated by each of the 8 pipelines, change the R working directory to the location of the DSC file, and use the dscquery function from the dscrutils package to load the DSC results into R:
setwd("~/GIT/dsc/vignettes/one_sample_location")
library(dscrutils)
dscout <-
  dscquery(dsc.outdir = "first_investigation",
           targets = c("simulate.true_mean","analyze.est_mean","score.error"))
dscout
The results in this table illustrate a few essential features of a DSC program:
- Looking at the “simulate”, “analyze” and “score” table columns, we can confirm that DSC has run 8 different pipelines. Each pipeline runs a different combination of the simulate,analyzeandscoremodules.
- 
    In DSC, each module input and output is assigned a different value within each pipeline. This is very different from most programs where each variable is assigned a single value. For example, score.errorhas 8 different values for each of the 8 different pipelines. (We did not include thesimulate.datamodule output in this table because it is too large to show here, but its value can be extracted like the other outputs.)
- 
    Information flows between modules within the same pipeline. In the first pipeline (the first row of the table), for example, the error in the abs_errmodule (0.05858) is calculated from (1) the value of the true mean, which was set to 0 in thenormalmodule, and (2) from the estimated mean, which was set to 0.05858 in themeanmodule.
Pipeline evaluation, and alternatives
In the previous section, we observed that DSC generated 8 pipelines, in which each 8 pipeline is different combination of modules, and contains a different set of results. Now we explain at a high level how DSC produced these results.
- 
    DSC runs the two simulatemodules,normalandt, then stores the values assigned to$true_meanand$data. Therefore, the module outputs$true_meanand$datahave two alternative values: the value assigned by running thenormalmodule, and the value assigned by running thetmodule. (For example, the alternative values of$true_meanare 0 and 3.)normalandtare alternative modules in thesimulategroup. Therefore, although thenormalmodule appears in 4 pipelines, and thetmodule also appears in 4 pipelines, eachsimulatemodule only needs to be run once.
- 
    DSC runs the two analyzemodules,meanandmedian, then stores the values assigned to the module output,$est_mean. Themeanmodule is run twice, once for each alternative value of$data(the value of$datais assigned to module inputxin themeanandmedianmodules). Likewise, themedianmodule is run twice, once for each alternative value of$data$. Therefore, theanalyzestep is evaluated 4 times in total, and the module output$est_meanhas 4 alternative values. If you look closely at thedscoutdata frame above, you will see that theanalyze.est_meancontains 4 unique values.
- 
    DSC runs the two scoremodules,sq_errandabs_err. These modules both accept two pipeline variables as input,$est_meanand$true_mean. Since there are 4 alternative assignments to both$est_meanand$true_mean, eachscoremodule is evaluated 4 times, so in the end DSC stores 8 different values for the final module output,$error.
A naive approach would have been to run the simulate step 8 times, the analyze step 8 times, and the score step 8 times, but that would have been a waste of time, since many of the computations would be redundant. DSC performs the minimum amount of computation needed to generate the results for all the pipelines by generalizing the steps described here. Although wasted computation will have little noticeable effect on a small experiment such as this, this could be very important when large data sets are being simulated and analyzed.
Note that, under the hood, the order of evaluation may not be exactly as we described it here—for example, the mean and median modules might be evaluated with the output from mean before the t module is availabled—but the exact order of evaluation is unimportant for understanding how the DSC results are generated.
Recap
In this tutorial, we learned how the DSC file is used to define the key components of a DSC experiment:
- 
    Module inputs and outputs (pipeline variables); 
- 
    Module scripts and script variables. 
- 
    Module groups; and 
- 
    Module sequences (pipelines). 
We also learned:
- 
    How information flows between modules executed in a pipeline; 
- 
    How values are assigned to pipeline variables separately in each pipeline; 
- 
    How modules are evaluated, and how alternative values of module outputs are stored. 
Next steps
In Part II, we will extend this example and introduce a few other useful DSC features.