Running DSC on a computer cluster

Here we illustrate how DSC can facilitate remote computation. This is useful when working with a DSC that involves intensive and/or long-running computations; many of the computations involved are ammenable to parallel or distributing computation, so we would like to advantage of a high-performance computing system to speed up running a benchmark.

Prerequisites

We will use a simple example to demonstrate how to run DSC on a remote computing system. Although this example is not particularly well-suited for running on a high-performance compute cluster—it can be easily run on a standard desktop computer, and the individual computing tasks are very short—we will use this example to illustrate the features of DSC for facilitating remote computation. We will draw from the example presented in DSC basics, part I. The source code can be found here.

Our aim is to submit it to a remote cluster to generate all the results. Before doing so, we should make sure it works.

First, run the the “truncate” version of the DSC command to test the script (see here for details):

dsc first_investigate_simpler.dsc --truncate

If the “truncated run” reports error, try to debug DSC and test again until the truncated run works.

Now we are ready to submit the DSC to run on computer cluster.

Overview of running remote jobs

DSC uses --host option to specify a configuration file for setting up computing environments, including HPC cluster. With proper configurations DSC will automatically submit to the job scheduling system on the HPC to run benchmark.

Under the hood, DSC configures all modules and converts them to SoS tasks. Interested readers may refer to this page for implementation details.

Basic configuration example

DSC command option --host accepts a YAML configuration file that specifies a template for remote jobs. We begin with an example illustrating the essential elements of a configuration file. This example was configured for the high-performance compute cluster run out of the University of Chicago (“midway”), so we have called this file “midway.yml”. To use this configuration file to run your benchmark (called “mybenchmark.dsc”), you would log in to the remote computing system, and run the following command: dsc mybenchmark.dsc --host midway.yml.

DSC:
  midway2:
    queue_type: pbs
    status_check_interval: 120
    max_running_jobs: 10
    task_template: |
      #!/bin/bash
      #SBATCH --time=6:00:00
      #SBATCH --partition=broadwl
      #SBATCH --mem=4G
      #SBATCH --exclusive
    submit_cmd: sbatch {job_file}
    submit_cmd_output: "Submitted batch job {job_id}"
    status_cmd: squeue --job {job_id}
    kill_cmd: scancel {job_id}

Let’s walk through this configuration file step-by-step:

DSC is required section title for specifying host configuration.
midway2 is the label assigned to the configuration.
status_check_interval: 120 means that DSC checks the job status every 120 seconds (2 minutes), and takes action if the job has failed or completed. It is important to set this interval to a reasonable period so as not to overload the job scheduling service when it is being shared with many other users.
max_running_jobs: 10 the maximum number of jobs that will run at any one time. Again, it is important not to set this to be too large on a system that is shared with other users.
The task_template entry gives the initial steps that are run to set up the computing environment for computation in DSC. In this case, this is a basic bash script with additional instructions for the Slurm job scheduler. The amount of memory and the time limit in this template is hard-coded for simplicity, but we will demonstrate later how to adjust them for different modules. In this simple example, the time limit of 6 hours per job submitted is probably a vast overestimate of the actual time needed.
The submit_cmd, Submitted batch job {job_id}, squeue --job and kill_cmd give additional instructions on how DSC should interact with the job management system (in this case, the Slurm job management system).

This basic configuration file is sufficient to run DSC computations on a system that uses Slurm; compute systems that use different job management software (e.g., TORQUE) can be configured similarly.

More advanced configuration example

Here we provide a more advanced template for use with Slurm on UChicago RCC. It configures multiple queues midway2 and stephenslab.

DSC:
  midway2:
    description: UChicago RCC cluster Midway 2
    queue_type: pbs
    status_check_interval: 30
    max_running_jobs: 30
    max_cores: 40
    max_walltime: "36:00:00"
    max_mem: 64G
    task_template: |
      #!/bin/bash
      #{partition}
      #{account}
      #SBATCH --time={walltime}
      #SBATCH --nodes={nodes}
      #SBATCH --cpus-per-task={cores}
      #SBATCH --mem={mem//10**9}G
      #SBATCH --job-name={job_name}
      #SBATCH --output={cur_dir}/{job_name}.out
      #SBATCH --error={cur_dir}/{job_name}.err
      cd {cur_dir}
      module load R 2> /dev/null
    partition: "SBATCH --partition=broadwl"
    account: ""
    submit_cmd: sbatch {job_file}
    submit_cmd_output: "Submitted batch job {job_id}"
    status_cmd: squeue --job {job_id}
    kill_cmd: scancel {job_id}
  stephenslab:
    based_on: midway2
    max_cores: 28
    max_mem: 128G
    max_walltime: "10d"
    partition: "SBATCH --partition=mstephens"
    account: "SBATCH --account=pi-mstephens"

default:
  queue: midway2
  instances_per_job: 40
  nodes_per_job: 1
  instances_per_node: 4
  cpus_per_instance: 1
  mem_per_instance: 2G
  time_per_instance: 3m

simulate:
  instances_per_job: 200

score:
  queue: midway2.local

The section DSC is required to provide various templates for a number of systems. Here, midway2 is a host provided by The Research Computing Center at the University of Chicago. Jobs are submitted to partition=broadwl. Typically it has 40 cores per node. It is recommended that users do not submit more than 60 jobs at a time, and maximum running time should best be under 36hrs per job. These limitations have been reflected by the max_* values in the configuration.

There are also 2 “derived” queues: stephenslab is a special partition on midway2 that allows for different configurations, thus it is derived from midway2 via based_on: midway2.

The section default is also required. It provides default settings for all modules in the DSC. Available settings are:

queue: name of the queue on the remote host to use, one of the various queues defined in DSC section.
- Here in the template it is set to midway2.
- <queue>.local is convention to execute locally without submitting to PBS.
time_per_instance: maximum computation time for each module instance.
instances_per_job: the maximum number of module outputs that each job should should attempt to generate. This is useful consolidating numerous light-weight module instances into one jobs submission. If the DSC contains long-running modules, this number should be made smaller, otherwise each job will take a long time to complete. For modules that run very quickly, instances_per_job should be set to a large number to avoid overloading the job management system on a shared computing environment.
nodes_per_job: this will utilize multi-node processing via the MPI interface so that each job is capable of running on multiple nodes to further parallel all module instances.
instances_per_node: how many parallel module instances are allowed on each node.
cpus_per_instance: how many CPUs to use for each module instance.
mem_per_instance: maximum memory used for each module instance.

For example for 100 module instances of simulate that each generates some data in under a minute, one can specify time_per_instance: 1m and instance_per_job: 200. Then a single job containing 200 simulations will be submitted to the host with a total of 200 minutes computation time reserved.

Typically, DSC and default section for host configuration do not have to be changed for different projects. Users can carefully configure them once, and reuse for various projects. For Stephens Lab users for example, one can take the example from above and configure resources for their DSC modules (sections after default).

In our example, under the folder vignettes/one_sample_location you should find a file called midway.yml as described above. You can open it with a text editor (on a cluster use nano from the terminal for example), edit it to suit your project, and save it.

Please do not use this template without configuring the required resources, particularly time_per_instance and instance_per_job. They should be tailored for your project. In proper configuration of these settings will lead to running many small jobs (too low on instance_per_job) and waste of resource (too high on time_per_instance).

Run remote jobs

dsc ... --host /path/to/config.yml

will configuration in /path/to/config.yml and submit jobs.

In our example, under the same folder (“/dsc/vignettes/one_sample_location”), we run

dsc first_investigation_simpler.dsc --host midway.yml

Monitor and manage jobs

When jobs are submitted, you should see on the screen something like:

INFO: M1_9056f36343b32614 submitted to midway2 with job id 59250575

So there is a task ID M1_9056f36343b32614 and job ID 59250575. This can also be observed from your system’s job queue, for example:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          59250575   broadwl M1_9056f kaiqianz  R       0:04      1 midway2-0003
          59250576   broadwl M1_10b92 kaiqianz  R       0:04      1 midway2-0003

To check on one of the job, for example the first one:

sos status M1_9056f -v3

You will see information such as:

M1_9056f36343b32614	56ed1083fa5bd015 normal normal_normal_1	
Created 20 hr ago	Started 2 min ago	Ran for 1 sec 	
completed

This means that the job comes from the module normal and produces 1 file normal/normal_1. There are also some other useful info on execution stats such as memory usage.

To have more details,

sos status M1_9056f -v4

and the output:

M1_9056f36343b32614	completed

Created 21 hr ago
Started 6 min ago
Ran for 1 sec
TASK:
=====

TAGS:
=====
56ed1083fa5bd015 normal normal_normal_1

ENVIRONMENT:
============
_runtime              {'cores': 1,
 'cur_dir': '/home/kaiqianz/dsc/vignettes/one_sample_location',
 'home_dir': '/home/kaiqianz',
 'mem': 2000000000,
 'run_mode': 'run',
 'sig_mode': 'default',
 'verbosity': 2,
 'walltime': '00:03:00'}
step_name             'normal'

execution script:
================
#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH --partition=broadwl
#
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH --job-name=M1_9056f36343b32614
#SBATCH --output=/home/kaiqianz/dsc/vignettes/one_sample_location/M1_9056f36343b32614.out
#SBATCH --error=/home/kaiqianz/dsc/vignettes/one_sample_location/M1_9056f36343b32614.err
cd /home/kaiqianz/dsc/vignettes/one_sample_location
module load R
sos execute M1_9056f36343b32614 -v 2 -s default
standout output:
================
9056f36343b32614: completed
output: first_investigation/normal/normal_1.rds

standout error:
================
9056f36343b32614: completed

You will see the actual SBATCH script submitted. Of perticular interest is perhaps the .err and .out. They record screen output of a job. To check them out:

cat /home/kaiqianz/dsc/vignettes/one_sample_location/M1_9056f36343b32614.err

Note that the current default version of R is 3.5.1.
INFO: M1_9056f36343b32614 started
INFO: All 1 tasks in M1_9056f36343b32614 completed

So here are three lines of messages related to loading R modules and nothing else. This is perhaps a good sign that no error or warning messages were generated from R-based jobs.

Computer resource overhead

DSC keeps track of jobs submitted to compute nodes on the computer you submit the jobs from. This is some resource “overhead” one has to pay for to run DSC on cluster. While it is convenient and tempting to run directly from the head node, we recommend to submit this DSC command as a job itself to a compute node. Or, to open up an interactive session on a compute node and run from there.

The number of CPUs you ask for running the DSC submitter command should match the dsc ... -c option. You can check the default setting for -c by running dsc -h and look for documentation on -c to find out the default (because the default depends on the machine you run it from). We recommand requesting a compute node for 4 CPUs each with 2GB memory allocated. Then submit with -c 4.