DSC Basics, Part II

This is the second part of the “DSC Basics” tutorial. Before working through this tutorial, you should have already read DSC Basics, Part I. Here we build on the mean estimation example from the previous part to illustrate new concepts and syntax in DSC, with an emphasis on the use of module parameters.

Materials used in this tutorial can be found in the DSC vignettes repository. As before, you may choose to run this example DSC program as you read through the tutorial, but this is not required. For more details, consult the README in the “one sample location” DSC vignette.

Adding a module parameter to normal

In our example DSC, recall we defined the normal module as follows:

normal: R(x <- rnorm(n = 100,mean = 0,sd = 1))
  $data: x
  $true_mean: 0

Here we propose to make a slight improvement to this module by adding a module parameter, n:

normal: R(x <- rnorm(n,mean = 0,sd = 1))
  n: 100
  $data: x
  $true_mean: 0

We have defined a module parameter n and set its value to 100. Once we have defined n, any of the R code may refer to this module parameter. In the R code, the first argument of rnorm is set to the value of n (which is 100).

In this first example, there is not much benefit to defining a module parameter n. In the examples below, the advantages of module parameters will become more apparent.

Adding a second module parameter to normal

In our current design of normal, we made an unfortunate choice: the mean used to simulate the data is defined twice, once inside the call to rnorm, where we set mean = 0, and once when we set the module output $true_mean to zero. If we decide to use a different mean to simulate the data, then we would have to be careful to change the code in two different places.

It would be better if the mean of the data was defined once. This can be accomplished with a module parameter, which we will name mu (the Greek letter conventionally used to denote the mean):

normal: R(x <- rnorm(n,mean = mu,sd = 1))
  n: 100
  mu: 0
  $data: x
  $true_mean: mu

Here, we have defined a second module parameter mu, and set its value zero. Now the mean argument of rnorm can be set to the value of mu.

Additionally, since mu is also a script parameter, the module output $true_mean can be set to the value of (script parameter) mu. (In this example, the value of the module parameter happens to be the same as the value of the variable mu used in the R code, but in some cases the R code might modify the value of mu, in which case the module parameter and script parameter will be different. So it is important to keep these quantities distinct.)

With this change to the module definition, modifying the mean used to simulate the data only requires editing one line of code instead of two.

Likewise, we can use a module parameter to specify the mean of the data simulated from a t distribution:

t: R(x <- mu + rt(n,df = 2))
  n: 100
  mu: 3
  $data: x
  $true_mean: mu

Note that there is no requirement that the module parameters for the normal and t modules have the same name, mu, but in this case makes sense to do so. One advantage of defining parameters with the same name is that makes it easier to query the results.

The order of evaluation inside a module

In the examples above, we informally introduced the notion of a model parameter. Below, we will give some more elaborate examples with module parameters, so here we take a moment to describe more formally how a module parameter behaves in relation to other components of a module:

  • A module parameter cannot depend on any of the module inputs, and it can only depend on other module parameters previously specified.

  • Module parameters are evaluated before the module script. The exact procedure for evaluating a module is as follows:

    1. Evaluate any R code used to determine the values of the module parameters (we give an example of this below).

    2. Set the values of the module parameters.

    3. Initialize the module inputs according to the current stored values of the pipeline variables.

    4. For each module parameter and module input, define a script variable in the global environment in which the script is evaluated with the same name and same value as the module parameter or input.

    5. Evaluate the module script or inline source code. All script variables are retained for resolving any module outputs.

    6. Evaluate the expressions used to determine the values of the module outputs.

To illustrate the evaluation procedure, consider the simple example of a module we gave above, repeated here for convenience:

normal: R(x <- rnorm(n,mean = mu,sd = 1))
  n: 100
  mu: 0
  $data: x
  $true_mean: mu

The steps in evaluating the module are as follows:

  1. The first step is skipped because no R code needs to be run to evaluate the module parameters.

  2. The module parameters n and mu are set to 100 and 0, respectively.

  3. The third step is skipped because no inputs are defined for this module.

  4. In the global R environment, variables n and mu are defined and set to the values of the module parameters n and mu (in this case, 100 and 0).

  5. The expression x <- rnorm(n,mean = mu,sd = 1) is parsed and evaluated in the same global R environment.

  6. Module output $data is assigned the value of R expression x (which is a vector of length 100), and module output $true_mean is assigned the value of R expresion mu (which is just zero because mu was unchanged when the R code was run). In both cases, the R expressions are simply variable names, but we point out that they can be more general R expressions.

We will refer back to this evaluation procedure in other examples below.

A single module parameter with multiple alternative values

Above, we gave a couple examples of defining module parameters. Here, we will demonstrate an important feature of module parameters: they can be used to define multiple modules that are similar to each other.

Our current definition of the normal module simulates 100 random samples from a normal distribution. Suppose we would like to define a second module that simulates 1,000 random samples from the same normal distribution. This is easily done by defining a new module parameter n that takes on two different values:

normal: R(x <- rnorm(n,mean = mu,sd = 1))
  n: 100, 1000
  mu: 0
  $data: x
  $true_mean: mu

The comma delimits the two possible values of model parameter n.

Now that we have defined n inside this module, we can refer to this module parameter inside the R code that simulates random draws from a normal distribution, as in the example above.

To be precise, this code defines a module block with two modules. It is equivalent to defining two modules, normal_100 and normal_1000, that are identical in every way except that the first module includes parameter definition n: 100 and the second defines n: 100. The module block above is of course much more succinct.

The line n: 100, 1000 should not be interpreted as defining a vector or sequence with two entries, 100 and 1000. It defines a set of alternative values. To put it another way—and this is the terminology we use frequently—n: 100, 1000 defines two alternative values for module parameter n, and therefore defines two alternative modules that are the same in every way (including their name, normal) except for the setting of n.

An important property of module parameters with multiple alternative values is that their order does not matter. For example, if we instead wrote n: 1000, 100, the DSC results will be exactly the same as n: 100, 1000. The only thing that will change is the order in which the results will appear in the tables, and the way in which the results are stored in files.

Although the two modules both have the same name, normal, their outputs can still be easily distinguished in the results; for example, if you want to compare the accuracy of the estimates in the larger (n = 1000) and smaller (n = 100) simulated data sets, the results from these two modules can be distinguished by the stored value of the module parameter n. We will see an example of this next.

Executing the DSC with four simulate modules

Let’s go ahead and generate results from our new “mean estimation” DSC. In the new DSC, the simulate modules are defined by two module blocks:

normal: R(x <- rnorm(n,mean = mu,sd = 1))
  n: 100, 1000
  mu: 0
  $data: x
  $true_mean: mu

t: R(x <- mu + rt(n,df = 2))
  n: 100, 1000
  mu: 3
  $data: x
  $true_mean: mu

The rest of the DSC remains unchanged from before.

This new DSC is implemeted by simulate_data_twice.dsc inside the one_sample_location vignette folder.

To run the DSC benchmark, change the working directory (here we have assumed that the dsc repository is stored in the git subdirectory of your home directory),

cd ~/git/dsc/vignettes/one_sample_location

remove any previously generated results,

rm -Rf first_investigation.html first_investigation.log first_investigation

then run 10 replicates of all the pipelines:

dsc simulate_data_twice.dsc --replicate 10
INFO: DSC script exported to first_investigation.html
INFO: Constructing DSC from simulate_data_twice.dsc ...
INFO: Building execution graph & running DSC ...
[#############################] 29 steps processed (295 jobs completed)
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time 29.389 seconds.

Comparing the DSC results with n=100 and n=1000

Now let’s inspect the DSC results in R. Change the R working directory to the location of the DSC file, and use the dscquery function from the dscrutils package to load the DSC results into R:

setwd("~/GIT/dsc/vignettes/one_sample_location")
library(dscrutils)
dscout <-
  dscquery(dsc.outdir = "first_investigation",
           targets = c("simulate.n","analyze","score.error"))
nrow(dscout)

Loading dsc-query output from CSV file.
Reading DSC outputs:
 - score.error: extracted atomic values
160

The DSC command we ran above generated results for 10 replicates of 16 pipelines. We now have double the number of pipelines we had before, which is expected because we now have 4 simulate modules (2 normal modules and 2 t modules), whereas before we had 2 simulate modules. To confirm this, we see that each of the four simulate modules appears in 40 pipelines:

with(dscout,table(simulate,simulate.n))

        simulate.n
simulate 100 1000
  normal  40   40
  t       40   40

The "simulate.n" column in the dscout data frame gives the value of module parameter n defined inside a simulate module.

We would expect that population mean estimates improve with more data. We can quickly check this by comparing the average squared error in the pipelines with 100 samples against the average error in the pipelines with 1,000 samples:

dat <- subset(dscout,score = "sq_err")
as.table(by(dat,
            with(dat,list(analyze,simulate.n)),
            function (x) mean(x$score.error)))

               100        1000
mean   0.032322738 0.006269221
median 0.018195606 0.001767654

Indeed, based on the results from these 10 replicates, we observe that the accuracy of both methods (mean and median) improves considerably with more data (on average), and in both cases the median is more accurate than the mean on average.

Two module parameters with multiple alternatives

If you provide more than one value for multiple module parameters, DSC considers all combinations of the values.

For example, suppose we want to evaluate estimators of the population mean when the data are simulated from the t distribution with different numbers of degrees of freedom. In DSC, this can be succinctly expressed by defining another module parameter, df, with multiple values:

t: R(x <- mu + rt(n,df))
  n: 100, 1000
  mu: 3
  df: 2, 4, 10
  $data: x
  $true_mean: mu

This new module block defines 6 t modules from the 6 different ways of setting both the n and df parameters.

Let’s clear the previous results and run the DSC benchmark with this new module:

rm -Rf first_investigation.html first_investigation.log first_investigation
dsc simulate_multiple_dfs.dsc --replicate 10
INFO: DSC script exported to first_investigation.html
INFO: Constructing DSC from simulate_multiple_dfs.dsc ...
INFO: Building execution graph & running DSC ...
[#############################] 29 steps processed (575 jobs completed)
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time 57.557 seconds.

Now let’s load all the results generated with the t-simulated data:

dscout2 <-
  dscquery(dsc.outdir = "dsc_result",
           targets = c("t.n","t.df","analyze","score.error"))

Loading dsc-query output from CSV file.
Reading DSC outputs:
 - score.error: extracted atomic values

In total, we have results from 240 pipelines, which we can break down as follows:

with(dscout2,table(t.n,t.df))

      t.df
t.n     2  4 10
  100  40 40 40
  1000 40 40 40

Each of the six t modules was run 40 times: 2 analyze modules x 2 score modules x 10 replicates.

Using a module parameter to set the seed

Here we illustrate another practical use of module parameters for defining modules with different “seeds”.

To ensure reproducible results, it is often necessary to initialize the state, or seed, of the pseudorandom number generator. For example, in R the sequence of pseudorandom numbers is initialized by calling set.seed(x), in which x is an integer. (Note that DSC provides a default seed setting in R.)

For example, we can define 10 modules that generate 10 normally distributed data sets:

normal: R(set.seed(seed); x <- rnorm(n,mean = mu,sd = 1))
  seed: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
  n: 100
  mu: 0
  $data: x
  $true_mean: mu

The only difference in the 10 normal modules is the sequence of pseudorandom numbers used to simulate random draws from the normal distribution.

See multiple_seeds.dsc in the one_sample_location directory for the working example.

Combining module parameters with module inputs

It is also possible to combine module parameters with module inputs.

Recall, in the initial tutorial we defined a third analyze module that implemented the “Winsorized” mean. The trim argument to winsor.mean determined what proportion of data to “squish” from the top and bottom of the distribution.

Suppose we wanted to assess the impact of the trimming on the estimate accuracy. To do this, we could introduce a module parameter trim with multiple settings:

winsor: R(y <- psych::winsor.mean(x,trim,na.rm = TRUE))
  trim: 0.1, 0.2
  x: $data
  $est_mean: y

Each time a data set $data is provided to the simulate module, DSC will run two different winsor modules: one with trim = 0.1, and a second with trim = 0.2.

Intuitively, one may want to adjust the trim setting dynamically based on the data (e.g., based on the fit to the normal distribution). However, this is not possible in DSC because module parameters must be set independently of the module inputs.

More complex module parameters

Above, we illustrated the use of module parameters to succinctly define multiple similar modules. All the above examples used module parameters with simple (“atomic”) values. Here we illustrate defining module parameters in DSC with more complex values. We focus on the case when the module parameter is a vector of length 2. Although this example is still relatively simple, it is meant only to illustrate the capabilities of DSC—the features introduced in this section can be used to define much more complex module parameters than the ones illustrated in this example. [TO DO: Add link here to appropriate place in the DSC Reference Manual talking about syntax for module parameters.]

Suppose, for example, we would like to simulate data sets from the t distribution with different numbers of degrees of freedom and variances. This could be easily done by defining two module parameters which multiple alternative settings:

t: R(x <- mu + sd * rt(n,df))
  n: 100, 1000
  mu: 0
  sd: 0.2, 0.5, 1
  df: 4, 6, 8, 10
  $data: x
  $true_mean: mu

DSC will define a module for each combination of the module settings; specifically, this module block defines $2 \times 3 \times 4 = 24$ modules, one for each combination of n (the number of samples), sd (the standard deviation) and df (the degrees of freedom).

Defining vector-valued module parameters in DSC

Now suppose that we want to have finer control over which combinations of sd and df are used to simulate the data; in other words, we do not want to simulate data sets for all combinations of sd and df. For example, suppose we want to simulate data from t distributions with these three settings of the standard deviation (sd) and degrees of freedom (df):

sd df
0.5 4
0.6 6
0.8 10

One way to accomplish this is to define a new module parameter that stores the settings for both the standard deviation and the degrees of freedom. DSC permits the use of parentheses to define vector-valued module parameters:

t: R(x <- mu + params[1] * rt(n,df = params[2]))
  n: 100, 1000
  mu: 0
  params: (0.5, 4), (0.6, 6), (0.8, 10)
  $data: x
  $true_mean: mu

This module block defines $2 \times 3 = 6$ modules, one for each combination of n and params; each setting of params is a vector of length 2, in which the first vector entry specifies the standard deviation parameter, and the second entry specifies the degrees of freedom.

Defining module parameters with un-quoted code strings

Another option for defining a module parameter is to input chunks of code directly. This is the best option for this example in terms of succinctness and readability, while maintaining complete input information (sd and df names). Here is the new t block using un-quoted string parameters

t: R(x <- mu + params["sd"] * rt(n,df = params["df"]))
  n: 100, 1000
  mu: 0
  params: c(sd = 0.5, df = 4),
          c(sd = 0.6, df = 6),
          c(sd = 0.8, df = 10)
  $data: x
  $true_mean: mu

As before, this module block defines $2 \times 3 = 6$ modules, one for each combination of n and params. The end result is the same as the above. However, there are some fundamental differences in this version:

  • In each module, the params module parameter is not a vector of length 2; rather it is a string value containing the R expression specified, e.g., "c(sd = 0.5, df = 4)".

  • In each module, the script parameter params is assigned to the value of expression evaluated in the R environment. In actual fact this is implemented by prepending R code here to the code inside the R() statement. Therefore, this module block is (for the most part) equivalent to the following three module blocks in which the only difference between the modules is the params assignment inside the R code:

t1: R(params <- c(sd = 0.5, df = 4); x <- mu + params["sd"] * rt(n,df = params["df"]))
  n: 100, 1000
  mu: 0
  $data: x
  $true_mean: mu

t2: R(params <- c(sd = 0.6, df = 6); x <- mu + params["sd"] * rt(n,df = params["df"]))
  n: 100, 1000
  mu: 0
  $data: x
  $true_mean: mu

t3: R(params <- c(sd = 0.8, df = 10); x <- mu + params["sd"] * rt(n,df = params["df"]))
  n: 100, 1000
  mu: 0
  $data: x
  $true_mean: mu

In this example, using un-quoted code string provides several advantages:

  1. The code remains succinct.

  2. As before, the vector entries are labeled by sd and df, making the code more understandable.

  3. Because the code is evaluated in the same R environment as the module script, the labels are retained, and so we can write params$sd and params$df instead of params[1] and params[2] to access the settings of the standard deviation and degrees of freedom.

Defining module parameters with R()

This example has achieved our aim of directly specifying combinations of sd and df, but the params module parameter will now be more difficult to use when inspecting the results in query, because in order to match it in query one has to type in the exact code script eg c(sd = 0.5, df = 4) including spaces.

Instead of using code script strings, we can use R() operator to generate vectors of inputs yet maintaining readability:

t: R(x <- mu + params[1] * rt(n,df = params[2]))
  n: 100, 1000
  mu: 0
  params: R(c(sd = 0.5, df = 4)),
          R(c(sd = 0.6, df = 6)),
          R(c(sd = 0.8, df = 10))
  $data: x
  $true_mean: mu

In the “params: ...” statement, the code inside each R(...) is evaluated inside an R environment, then the outcome of each evaluation is used to assign a value to each setting of params. This achieves the same result as using vector-valued inputs, yet it is also a slight improvement because the intent of the first and second entries is now more clear from the code because the vector entries are labeled by sd and df.

In general, R() provides much more flexibility in defining complex module parameters because it provides access to the many features of R. (The only restriction is that none of the module inputs can be referenced within the R code.)

Defining module parameters with R{}

A third option for defining module parameters with R code is to use R{}:

t: R(x <- mu + params[1] * rt(n,df = params[2]))
  n: 100, 1000
  mu: 0
  params: R{list(c(sd = 0.5, df = 4),
                 c(sd = 0.6, df = 6),
                 c(sd = 0.8, df = 10))}
  $data: x
  $true_mean: mu

The R{} statement evaluates the code given inside the curly braces, then subdivides the elements of the top-level sequence (vector or list) among multiple settings of the module parameters. In other words, a single R{} statement defines a set of alternatives. In this example, the R code evaluates to a list with 3 elements, in which each list element is a vector of length 2. Therefore, params is assigned 3 alternative values, in which each value is a vector of length 2.

In this particular example, defining the module parameter with R() or R{} achieved the same result, and either way the code is equally understandable, so there is no particular advantage of one syntax over the other. The main utility of R{} is that it is able to succinctly define longer sets of alternative settings. For example, in the random seed example above, creating a module parameter that defines 10 different seeds was rather tedious. The same example using R{} is

normal: R(set.seed(seed); x <- rnorm(n,mean = mu,sd = 1))
  seed: R{1:10}
  n: 100
  mu: 0
  $data: x
  $true_mean: mu

Recap

In this tutorial, we showed how module parameters can be used to succinctly define multiple similar modules.

The multiple similar modules defined in a module block can be identified according to the name of the module and the module parameter settings.

The keywords R() and R{} are useful for defining more complex-valued module parameters.

Exploring further

In this tutorial, we introduced the features of DSC that are most essential to developing your own DSC benchmark. There are many other features of DSC that we did not have a chance to explore in these introductory tutorials—to learn more, visit…