DSC Basics, Part I

This is a continuation of the Introduction to DSC. Here we revisit the same example from the first part to explain key concepts in DSC, including modules, groups and pipeline variables. More advanced concepts will be introduced in Part II.

Materials used in this tutorial can be found in the DSC vignettes repository. As before, you may choose to run this example DSC program as you read through the tutorial, but it is not required. For more details, consult the README in the “one sample location” DSC vignette.

Pipeline variables

In DSC, all information is passed from one module to another through pipeline variables. All pipeline variables are indicated with a $ as the first character, as in $data or $mean_test_error.

Under the hood, when a module outputs a pipeline variable, the value of this variable is saved to a file; and when a pipeline variable is provided as input to a module, the value of this variable is read from a file. You will never need to access these files directly; DSC provides a user interface that allows you to access (i.e., query) the pipeline variables without having to know where or how they are stored. In the previous tutorial we illustrated how the pipeline variables can be queried in R.

Planning a DSC file

The main aim of DSC is to make your benchmark, or computational experiment, easier to read, easier to maintain, and easier to extend. To achieve these aims, we recommend that you to start by planning out your DSC project.

We suggest starting the planning by identifying the module types (the key computational steps in your benchmark) and the pipeline variables (the quantities being fed from one step to another). All pipeline variables should be given informative names (which must begin with a $). It is also helpful to give the modules informative names. (Note: unlike other programming languages such as R, names in DSC cannot have a period.)

Recall, in our working example, that we compared methods for estimating a population mean from a simulated data sample. Although this example was very simple—DSC can handle much more complex settings than this—this example is useful for explaining the basic ideas behind DSC.

This example followed the simulate-analyze-score design pattern, meaning that the benchmark could be naturally decomposed into three module types:

  • A simulate module that generates a vector of simulated data ($data), and the population mean setting used to simulate these data ($true_mean).

  • An analyze module that accepts $data as input, and outputs an estimate of the population mean ($est_mean).

  • A score module that accepts inputs $est_mean and $true_mean, and outputs an error measure ($error).

Therefore, our DSC has four pipeline variables: $data, $true_mean, $est_mean and $error.

For clarity, we have summarized this information in the comments at the top of the DSC file (the # character indicates a comment in a DSC file—anything after a # is ignored by DSC):

# PIPELINE VARIABLES
# $data       simulated data (vector)
# $true_mean  population mean used to simulate $data (scalar)
# $est_mean   population mean estimate (scalar)
# $error      error in the estimate (scalar)

# MODULE TYPES
# name     inputs                outputs
# ----     ------                -------
# simulate none                  $data, $true_mean
# analyze  $data                 $est_mean
# score:   $est_mean, $true_mean $error

Two important notes about pipeline variables

  1. The variables outputted by modules should also include the parts of your benchmark that you want stored for future use. For example, $error is here considered a module output since we would like to use this quantity to study the results of the experiment.

  2. The pipeline variables are the only way that modules can communicate with one another. So if a module requires access to a piece of information generated by a previous module, then this must be a pipeline variable. For example, a score module requires access to the true mean used to generate the data, so here this quantity is assigned to a pipeline variable $true_mean from a simulate module.

In summary, all information in a DSC program is local to each module unless it is defined as a module output and assigned to a pipeline variable using the $ character.

Defining modules in a DSC file

At its simplest, a DSC module consists of a name, a script (code) implementing the module, and details of how quantities are to be passed in and out of each script.

Here we illustrate the syntax by explaining each module in our example. To keep the presentation focused on the key concepts, we use a slightly simplified version of the mean estimation example that achieves the same result; see file first_investigation_simpler.dsc in the DSC vignettes repository for the full example.

The normal module

The first module defined in the DSC file is the normal module:

normal: R(x <- rnorm(n = 100,mean = 0,sd = 1))
  $data: x
  $true_mean: 0

(Again, this code is simplified slightly from the previous tutorial, but it achieves the same result.)

This code tells DSC three things:

  1. The name of the module is “normal”.

  2. The R script implementing the module is a single line of code: x <- rnorm(n,mean = 100,sd = 1). Here, R() tells DSC that this code should be parsed and evaluated as an R script. For longer code, this should be replaced with the name of a file containing the R code. Any global variables (that is, variables that are not local to a function) defined inside a script are called “script variables”; in this module, one script variable is defined, x.

  3. After running the script, the pipeline variable $data is set to the value of script variable x. A second pipeline variable, $true_mean, is set to 0.

The t module

The t module has the same outputs as the normal module, but generates $data from a t distribution with a mean of 3 and 2 degrees of freedom. After running the script, the pipeline variable $data is assigned the value of script variable x, and $true_mean is set to 0.

Here is the code:

t: R(x <- 3 + rt(n = 100,df = 2))
  $data: x
  $true_mean: 3

The two analyze modules

Our example has two analyze modules: the mean module estimates the population mean by the sample mean, and the median module estimates the population mean by the sample median.

They are defined in the DSC file as follows:

mean: R(y <- mean(x))
  x: $data
  $est_mean: y

median: R(y <- median(x))
  x: $data
  $est_mean: y

These modules differ from the simulate modules in that they have inputs in addition to outputs:

  • The line x: $data specifies a module input. It tells DSC that, before running the R code, it should define a new global variable x in the R environment, and set the value of x to the value of the pipeline variable $data. The value of $data is given by the most recently run module in the pipeline that assigned a value to $data.

  • The line $est_mean: y specifies a module output. It tells DSC that after running the script it should set the value of the pipeline variable $est_mean to the value of the script variable y.

Important note: Although the R code in the normal, t, median and mean modules all define a script variable x, these variables are distinct (i.e., they are local to each module), and we must use a pipeline variable (here we use $data) to pass the information on x from one script to another.

In summary, all script variables are local to each module, and information can flow from one module to another only through pipeline variables.

The two score modules

Finally, we create two score modules that measure error in the estimate, one based on squared differences (sq_err) and another based on absolute differences (abs_err):

sq_err: R(e <- (x - y)^2)
  x: $est_mean
  y: $true_mean
  $error: e

abs_err: R(e <- abs(x - y))
  x: $est_mean
  y: $true_mean
  $error: e

The inputs to both modules are $est_mean and $true_mean, and the output is $error.

In each of these modules, there are three script variables:

  1. x is set to the current value of pipeline variable $est_mean.

  2. y is set to the current value of pipeline variable $true_mean.

  3. e is determined by the R code, and its value is assigned to pipeline variable $error.

This completes the module definitions.

Defining groups and pipelines

So far, we have defined the modules—that is, the individual computational steps of the DSC experiment—but we have not yet described how these modules relate to each other, and how they are used to build pipelines. This is the purpose of the DSC block in the DSC file; the key components of the DSC block are the define and run statements.

DSC:
  define:
    simulate: normal, t
    analyze: mean, median
    score: abs_err, sq_err
  run: simulate * analyze * score

This code does the following:

  1. DSC is a special keyword indicating that we are defining the module groups and pipelines (that is, we are not defining a module).

  2. define is another keyword indicating that we are defining module groups.

  3. We define three module groups: simulate, analyze and score. The distinguishing characteristic of module groups is that they should have a similar function (and, in the simplest case, the same inputs and outputs, although this is not required).

  4. run is a keyword indicating that we are defining the computational pipelines (sequences of modules) to be executed.

  5. The A * B notation asks DSC to generate all possible sequences of modules from groups A and B; that is, all sequences of the form (a, b) where a is a module in group A and b is a module in a group B. So, in this example, simulate * analyze * score generates all pipelines that consist of a module from the simulate group (normal or t), followed by a module from the analyze group (mean or median), and then a module from the score group (sq_err or abs_err). In this example, there are $2 \times 2 \times 2 = 8$ different pipelines defined by this run statement.

Executing the DSC benchmark

Up to this point, all we have is a bunch of code stored in a text file; the DSC file doesn’t do anything unless it is interpreted and executed by the dsc program.

The dsc program will run all 8 pipelines, keeping track of the values of all the script variables and pipeline variables generated in each pipeline, and it will store the values of all the module outputs, which we will retrieve for analysis in R.

To run the DSC benchmark, change the working directory to the location of the first_investigation_simpler.dsc file. (Here we assume the dsc repository is stored in the git subdirectory of your home directory. If you are running the example yourself, please move to the appropriate directory on your computer.)

cd ~/git/dsc/vignettes/one_sample_location

Let’s also remove any previously generated results.

rm -Rf first_investigation.html first_investigation.log first_investigation

To keep this example as simple as possible, generate only a single replicate for each of the pipelines, and execute in parallel at most 2 modules at any one time:

dsc first_investigation_simpler.dsc --replicate 1 -c 2
INFO: DSC script exported to first_investigation.html
INFO: Constructing DSC from first_investigation_simpler.dsc ...
INFO: Building execution graph & running DSC ...
[#############################] 29 steps processed (29 jobs completed)
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time 6.206 seconds.

Inspecting the results of the DSC benchmark

To inspect the outcomes generated by each of the 8 pipelines, change the R working directory to the location of the DSC file, and use the dscquery function from the dscrutils package to load the DSC results into R:

setwd("~/GIT/dsc/vignettes/one_sample_location")
library(dscrutils)
dscout <-
  dscquery(dsc.outdir = "first_investigation",
           targets = c("simulate.true_mean","analyze.est_mean","score.error"))
dscout

Loading dsc-query output from CSV file.
Reading DSC outputs:
 - simulate.true_mean: extracted atomic values
 - analyze.est_mean: extracted atomic values
 - score.error: extracted atomic values
DSCsimulatesimulate.true_meananalyzeanalyze.est_meanscorescore.error
1 normal 0 mean 0.1088874 abs_err 1.088874e-01
1 normal 0 median 0.1139092 abs_err 1.139092e-01
1 t 3 mean 2.4245601 abs_err 5.754399e-01
1 t 3 median 2.9918614 abs_err 8.138572e-03
1 normal 0 mean 0.1088874 sq_err 1.185646e-02
1 normal 0 median 0.1139092 sq_err 1.297530e-02
1 t 3 mean 2.4245601 sq_err 3.311311e-01
1 t 3 median 2.9918614 sq_err 6.623636e-05

The results in this table illustrate a few essential features of a DSC program:

  1. Looking at the “simulate”, “analyze” and “score” table columns, we can confirm that DSC has run 8 different pipelines. Each pipeline runs a different combination of the simulate, analyze and score modules.
  • In DSC, each module input and output is assigned a different value within each pipeline. This is very different from most programs where each variable is assigned a single value. For example, score.error has 8 different values for each of the 8 different pipelines. (We did not include the simulate.data module output in this table because it is too large to show here, but its value can be extracted like the other outputs.)

  • Information flows between modules within the same pipeline. In the first pipeline (the first row of the table), for example, the error in the abs_err module (0.05858) is calculated from (1) the value of the true mean, which was set to 0 in the normal module, and (2) from the estimated mean, which was set to 0.05858 in the mean module.

Pipeline evaluation, and alternatives

In the previous section, we observed that DSC generated 8 pipelines, in which each 8 pipeline is different combination of modules, and contains a different set of results. Now we explain at a high level how DSC produced these results.

  1. DSC runs the two simulate modules, normal and t, then stores the values assigned to $true_mean and $data. Therefore, the module outputs $true_mean and $data have two alternative values: the value assigned by running the normal module, and the value assigned by running the t module. (For example, the alternative values of $true_mean are 0 and 3.) normal and t are alternative modules in the simulate group. Therefore, although the normal module appears in 4 pipelines, and the t module also appears in 4 pipelines, each simulate module only needs to be run once.

  2. DSC runs the two analyze modules, mean and median, then stores the values assigned to the module output, $est_mean. The mean module is run twice, once for each alternative value of $data (the value of $data is assigned to module input x in the mean and median modules). Likewise, the median module is run twice, once for each alternative value of $data$. Therefore, the analyze step is evaluated 4 times in total, and the module output $est_mean has 4 alternative values. If you look closely at the dscout data frame above, you will see that the analyze.est_mean contains 4 unique values.

  3. DSC runs the two score modules, sq_err and abs_err. These modules both accept two pipeline variables as input, $est_mean and $true_mean. Since there are 4 alternative assignments to both $est_mean and $true_mean, each score module is evaluated 4 times, so in the end DSC stores 8 different values for the final module output, $error.

A naive approach would have been to run the simulate step 8 times, the analyze step 8 times, and the score step 8 times, but that would have been a waste of time, since many of the computations would be redundant. DSC performs the minimum amount of computation needed to generate the results for all the pipelines by generalizing the steps described here. Although wasted computation will have little noticeable effect on a small experiment such as this, this could be very important when large data sets are being simulated and analyzed.

Note that, under the hood, the order of evaluation may not be exactly as we described it here—for example, the mean and median modules might be evaluated with the output from mean before the t module is availabled—but the exact order of evaluation is unimportant for understanding how the DSC results are generated.

Recap

In this tutorial, we learned how the DSC file is used to define the key components of a DSC experiment:

  1. Module inputs and outputs (pipeline variables);

  2. Module scripts and script variables.

  3. Module groups; and

  4. Module sequences (pipelines).

We also learned:

  1. How information flows between modules executed in a pipeline;

  2. How values are assigned to pipeline variables separately in each pipeline;

  3. How modules are evaluated, and how alternative values of module outputs are stored.

Next steps

In Part II, we will extend this example and introduce a few other useful DSC features.