Release 1.0 roadmap
Unit tests
Module names
- module name cannot start with
pipeline_
- [?] module names have to be SQL friendly
- [!] modules that are identical in every way except names
- module names match number of executables
Parser
- [] More tests to @FILTER
- More tests to @ALIAS
- More tests to @CONF
- Module inherit omitting exec
- No multiple module input
$x
,$y
or any mixture of them with module parameters - [!] All module output in the same block should have same name (a required good practice)
- []
DSC::run
we support only()
*
and,
- Module names / parameters cannot start / end with
_
and cannot have.
in them - Duplicate module / parameter names
- [] Different length of params in different exec. e.g. mybeta = 1,2,3 vs. mybeta = (4,5,6) will lead to file lock fail.
- [] Strings ending with
,
intentionally - [] Check if @ALIAS is of the right type (is a number or str?)
- [] Check if all parameter names are of the right type
@FILTER
cannot have pipeline variables$
- Packages installed from github also has version option
Execution
- [] Identical tasks will result in complaint. check for identical jobs ie same parameter twice
- [] Both in
DSC::run
and in--target
: what if the first module has upstream dependency? Should catch and report an error. - [] Unsupported keywords in
DSC
block - [] Bad pipeline logic specification (resulting in failure as said in #22 )
- [] Looped steps. Actually this should be a feature when desired …
- [] Downstream pipeline did not use any of upstream variables
- [] All modules are valid (defined)
- Find some Rmd source code and test if they work as executables
mashr
intro works
Query
- dsc-query strip path for dsc_output argument
Misc
- Do not write library installed files if installation fails
Documentation
Add these related discussions to documentation
Best practices
- [] RE seeds – users should ensure seeds for modules are always the same, when applicable.
Examples
- Convert all previous examples to new syntax
- [] Add a tutorial that compares computation time / speed between modules
- [] Add a tutorial for benchmark output managing, eg, remove / rerun specified steps and moving project from one computer to another
- Explain how DSC signature works
- [] Add a documentation page on remote execution, and a tutorial for it (
ash
example) - Add a doucmentation for
dscrutils
, and a tutorial on data extraction usingone_sample_location
example; and updateash
example to include result exploration
Engineering
- [] Use HDF5 to replace msgpack and reimplement IO_DB to only load the chunk necessary.
- [] Optimize
build_result_db
andbuild_io_db
via line profiler.
Small features
- [!] Add exec
depend
property to track source files - Create a switch / or –debug switch to generate mock run file
- [!] Pipeline seed batch has to be used / tested; based on total number of jobs distribute it smartly.
- Remove old Rlib info files
- Fix github R package arbitary paths
- [!] Add tags for queries
- Support
(N,P):(100,1000)
- [!] Unsupport nested tuples
((100,200),(300,400)), ((9,8))
– i actually have attempted to handle this in R - [!] Properly handle grouped input eg
g: (N,P)
(then in the table has to have 2 columns g_N and g_P) – no need. just query with"(N,P)"
as a string, otherwise users should use paired parameter. file()
/file(ext)
behavior is reversed … need to fix- Replace eg
simulate_n
withsimulate:n
in the column names - [] Reimplement the underlying
args
andsys.argv
for plugins – make them more general - Support
Rmd
file as executable - Rmd file chunk name
- Make operator
Python()
smarter, along the lines ofR()
operator - #98
- #99
Major features
Many were existing features removed due to new syntax and SoS advances etc. We need to bring them back in.
Enhanced interface syntax
inline
executables
R interface
- Make operator
R()
smarter: basically make adscrutils::run_r()
function that runs input string as R code and format result as a comma separated string. Raise an error if it fails to format. - [!] New data exchange format in HDF5
- Might stick to
rpy2
until we need to support Matlab
- Might stick to
- New data extraction interface / basic data exploration features in R
- A more self-contained way to load DSC related functions: a companion R package eventually?
Python interface
- Switch from RDS/rpy2 to HDF5
Shell command executable related issues
- [] Multiple output files
- [] Executable command options
`exec` specifies the names of executable computational routines as well as their command line arguments if applicable. For example an `exec` entry reads: exec: datamaker.R, ms $nsam $nreps -t $theta -seed $seed
- [] index slicing
Index for parameters, for example `exec: makeped.py $data $output[1]` where `output` parameter takes the form of `output: (1.ped, 1.map), (2.ped, 2.map)`. In this case `output[1]` will only use the first value of each parameter group.
Engineering
- optimize performance / minimize overhead to the best of my knowledge
- this is never ending – i mostly only deal with noticible bottleneck and I use line profiler to identify the culprit.
Large scale computations
--host
option- To check:
scp
,ssh
commands are available - To sync by DSC:
- the output folder
- the host config file
- To check:
- [!] CONF
merge
feature:`inline`: True or False, of whether or not an R script is executed inline with the next procedure instead of producing return files. This feature is useful when the cost of computation for a procedure is trivial compared to the cost of storing its output. For example if a simulation procedure is simply `runif(500000)` it makes more sense to save this line of code and execute it inline with the next step, rather than to save a vector of 500,000 random numbers to disk.
Known issues
SoS issues that needs to be fixed:
- Build mode not working
stderr
should remove empty files- Hang on dead-lock
SoS issues that cannot be fixed:
- Multiple loads of I/O data for each task distribution