Multiple DSC pipelines

In this tutorial we further expand benchmark of the DSC problem described in DSC Introduction, to demonstrate the use of multiple pipelines (pipeline ensembles) in DSC. Material used in this document can be found in DSC vignettes repo.

Configuration

The DSC problem is similar to what we have previously worked on, i.e. comparison of location parameter estimation methods. This time we simulate data under t distribution (df = 2) and Cauchy distribution. Then before estimating location parameter using mean or median method, there is an optional transform step where we provide two methods for Winsorization. This results in two DSC pipelines:

  • simulate -> estimate -> score
  • simulate -> transform -> estimate -> score

The DSC problem is fully specified as:

#!/usr/bin/env dsc

normal: normal.R
  n: 100
  $data: x
  $true_mean: 0

t: t.R
  n: 100
  df: 2
  $data: x
  $true_mean: 3

winsor1, winsor2: winsor1.R, winsor2.R
    x: $data
    @winsor1:
      fraction: 0.05
    @winsor2:
      multiple: 3
    $data: x

mean: mean.R
  x: $data
  $est_mean: y

median: median.R
  x: $data
  $est_mean: y

sq_err: sq.R
  a: $est_mean
  b: $true_mean
  $error: e
 
abs_err: abs.R
  a: $est_mean
  b: $true_mean
  $error: e 
  
DSC:
    define:
      simulate: normal, t
      transform: winsor1, winsor2
      analyze: mean, median
      score: abs_err, sq_err
    run: simulate * (analyze, transform * analyze) * score
    exec_path: R
    output: dsc_result

where transform module ensemble contains:


  ==> ../vignettes/one_sample_location_winsor/R/winsor1.R <==
  ##  replace the extreme values with limits
  winsor1 <- function (x, fraction=.05)
  {
     if(length(fraction) != 1 || fraction < 0 ||
           fraction > 0.5) {
        stop("bad value for 'fraction'")
     }
     lim <- quantile(x, probs=c(fraction, 1-fraction))
     x[ x < lim[1] ] <- lim[1]
     x[ x > lim[2] ] <- lim[2]
     return(x)
  }
  x = winsor1(x, fraction)

  ==> ../vignettes/one_sample_location_winsor/R/winsor2.R <==
  ## move the datapoints that are x times the absolute deviations from mean
  winsor2 <- function (x, multiple=3)
  {
     if(length(multiple) != 1 || multiple <= 0) {
        stop("bad value for 'multiple'")
     }
     med <- median(x)
     y <- x - med
     sc <- mad(y, center=0) * multiple
     y[ y > sc ] <- sc
     y[ y < -sc ] <- -sc
     return(y + med)
  }
  x = winsor2(x, multiple)

As a result, the previous analyze step is now a pipeline ensemble of (transform * analyze, analyze).

Execution

To run the benchmark

cd ~/GIT/dsc/vignettes/one_sample_location_winsor
./settings.dsc -c 30
INFO: Checking R library dscrutils@stephenslab/dsc/dscrutils ...
INFO: DSC script exported to dsc_result.html
INFO: Constructing DSC from ./settings.dsc ...
INFO: Building execution graph & running DSC ...
[#####################################################################################] 85 steps processed (85 jobs completed)
INFO: Building DSC database ...
INFO: DSC complete!
INFO: Elapsed time 9.494 seconds.