Release 1.0 roadmap

Unit tests

Module names

module name cannot start with pipeline_
[?] module names have to be SQL friendly
[!] modules that are identical in every way except names
module names match number of executables

Parser

[] More tests to @FILTER
More tests to @ALIAS
More tests to @CONF
Module inherit omitting exec
No multiple module input $x, $y or any mixture of them with module parameters
[!] All module output in the same block should have same name (a required good practice)
[] DSC::run we support only () * and ,
Module names / parameters cannot start / end with _ and cannot have . in them
Duplicate module / parameter names
[] Different length of params in different exec. e.g. mybeta = 1,2,3 vs. mybeta = (4,5,6) will lead to file lock fail.
[] Strings ending with , intentionally
[] Check if @ALIAS is of the right type (is a number or str?)
[] Check if all parameter names are of the right type
@FILTER cannot have pipeline variables $
Packages installed from github also has version option

Execution

[] Identical tasks will result in complaint. check for identical jobs ie same parameter twice
[] Both in DSC::run and in --target: what if the first module has upstream dependency? Should catch and report an error.
[] Unsupported keywords in DSC block
[] Bad pipeline logic specification (resulting in failure as said in #22 )
[] Looped steps. Actually this should be a feature when desired …
[] Downstream pipeline did not use any of upstream variables
[] All modules are valid (defined)
Find some Rmd source code and test if they work as executables
- mashr intro works

Query

dsc-query strip path for dsc_output argument

Misc

Do not write library installed files if installation fails

Documentation

Add these related discussions to documentation

Best practices

[] RE seeds – users should ensure seeds for modules are always the same, when applicable.

Examples

Convert all previous examples to new syntax
[] Add a tutorial that compares computation time / speed between modules
[] Add a tutorial for benchmark output managing, eg, remove / rerun specified steps and moving project from one computer to another
- Explain how DSC signature works
[] Add a documentation page on remote execution, and a tutorial for it (ash example)
Add a doucmentation for dscrutils, and a tutorial on data extraction using one_sample_location example; and update ash example to include result exploration

Engineering

[] Use HDF5 to replace msgpack and reimplement IO_DB to only load the chunk necessary.
[] Optimize build_result_db and build_io_db via line profiler.

Small features

Major features

Many were existing features removed due to new syntax and SoS advances etc. We need to bring them back in.

Enhanced interface syntax

inline executables

R interface

Make operator R() smarter: basically make a dscrutils::run_r() function that runs input string as R code and format result as a comma separated string. Raise an error if it fails to format.
[!] New data exchange format in HDF5
- Might stick to rpy2 until we need to support Matlab
New data extraction interface / basic data exploration features in R
A more self-contained way to load DSC related functions: a companion R package eventually?

Python interface

Switch from RDS/rpy2 to HDF5

[] Multiple output files

[] Executable command options

  `exec` specifies the names of executable computational routines as well as their command line arguments if applicable. For example an `exec` entry reads:

   exec: datamaker.R, ms $nsam $nreps -t $theta -seed $seed

[] index slicing

 Index for parameters, for example `exec: makeped.py $data $output[1]` where `output` parameter takes the form of `output: (1.ped, 1.map), (2.ped, 2.map)`. In this case `output[1]` will only use the first value of each parameter group.

Engineering

optimize performance / minimize overhead to the best of my knowledge
- this is never ending – i mostly only deal with noticible bottleneck and I use line profiler to identify the culprit.

Large scale computations

--host option
- To check: scp, ssh commands are available
- To sync by DSC:
  - the output folder
  - the host config file

[!] CONF merge feature:

 `inline`: True or False, of whether or not an R script is executed inline with the next procedure instead of producing return files. This feature is useful when the cost of computation for a procedure is trivial compared to the cost of storing its output. For example if a simulation procedure is simply `runif(500000)` it makes more sense to save this line of code and execute it inline with the next step, rather than to save a vector of 500,000 random numbers to disk.

Known issues

SoS issues that needs to be fixed:

Build mode not working
stderr should remove empty files
Hang on dead-lock

SoS issues that cannot be fixed:

Multiple loads of I/O data for each task distribution

Release 1.0 roadmap

Unit tests

Module names

Parser

Execution

Query

Misc

Documentation

Best practices

Examples

Engineering

Small features

Major features

Enhanced interface syntax

R interface

Python interface

Shell command executable related issues

Engineering

Large scale computations

Known issues