TORQUE Resource Manager

TORQUE is a resource management system for submitting and controlling batch jobs on supercomputers.

Batch jobs are typically submitted using a batch script which is a text file containing a number of job directives and GNU/linux commands or utilities. Batch scripts are submitted to the batch system (TORQUE), where they are queued awaiting free resources. The batch system has two components on Atlas:

Torque as the resource manager is in charge of:

  • Accepting and starting jobs/tasks across a batch farm (qsub).
  • Cancelling jobs (qdel).
  • Monitoring jobs (qstat).
  • Accounting.

MAUI as the scheduler is in charge of scheduling the jobs.

TORQUE includes numerous directives, which are used to specify resource requirements and other attributes for batch and interactive jobs. TORQUE directives can appear as header lines (lines that start with #PBS) in a batch job script or as command-line options to the qsub command.

Here an example of a generic batch script:

#!/bin/bash
    #PBS -q parallel
    #PBS -l nodes=1:ppn=24
    #PBS -l mem=100gb
    #PBS -l cput=1000:00:00
    #PBS -N JOB_NAME
    
    echo 'Hello DIPC!'
    

This batch script example can be read line by line as follows:

  • #!/bin/bash: run the job under the shell BASH.
  • #PBS -q parallel: send to the parallel queue.
  • #PBS -l nodes=1:ppn=24: make a reservation of 1 node and 24 cores per node.
  • #PBS -l mem=100gb: make a reservation of 100 GB of RAM memory.
  • #PBS -l cput=1:00:00: make a reservation of 1 hour of CPU time.
  • #PBS -N Hello_DIPC: give a name to the job.
  • echo 'Hello DIPC!': actual piece of code we want to run.

Hint

If you do not know the amount of resources in terms of memory or time your jobs are going to need, you should overestimate this values in the first runs and tweak those values up as you learn how jobs behave.

Once our batch script is prepared you can submit it as a batch job to the queue system.

Hint

Everytime you open a shell session you will land on your home directory. In the computing nodes this happens to be /scratch/username. Therefore, if your input files are located elsewhere, take this into consideration before running your application in the batch script, that is, you may need to cd into the directory where your input files are stored. TORQUE creates a special variable named $PBS_O_WORKDIR that points to the directory from which the job was submitted. Make use of it to keep your scripts as general and reusable as it is possible.

Hint

If you do not redirect the output of your runs to a specific file, then the standard output will be redirected to a file that goes by the name YOUR_JOBS_NAME.oXXXXXXX`. Where YOUR_JOBS_NAME is the name you gave your job in the batch script and XXXXXXXX the job identifier.

Similarly, the error output get redirected to a file that goes by the name YOUR_JOBS_NAME.eXXXXXXX.

Error and standard output files are created withing the $PBS_O_WORKDIR by default and as long as it is not specified otherwise.

You can find examples of batch scripts taylored for a particular HPC system on their devoted webpage.

How to submit jobs

You will typically want to submit your jobs from a directory located under your /scratch directory. This is why before anything you will need to move or copy all your necessary files to this filesystem.

qsub

To submit your job script (for example, batch_script.pbs), use qsub command.

qsub batch_script.pbs
    
    2345456.maui01
    

As a result of invoking the command, TORQUE will return the a job tag or identifier.

Monitoring your jobs

qstat

qstat command shows the status of the queue/jobs. Adding some options can enrich the output to display even more information:

qstat -a
    
    maui01: DIPC ATLAS cluster
                                                                                      Req'd    Req'd       Elap
    Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
    ----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
    321742.maui01           user1       p-slow-l job1             28613     1     24   30gb 10000:00:  R 3895:26:0
    375137.maui01           user2       p-slow-l job1             15691     1     24   40gb 1000000:0  R 4674:41:4
    376643.maui01           user1       p-slow-l job2            164095     1     12   50gb 10000:00:  R 2159:42:2
    376646.maui01           user3       p-slow-l superjob         22318     1     12   50gb 10000:00:  R 2131:58:2
    

Useful qstat options include:

Option Description
-u username Display jobs for a particular user
-a Display all jobs.
-f job_id Display detailed information of one particular job
-n Display jobs and computing nodes

showq

showq shows the status of the queue. It provides useful information such as the remaining time of the job.

showq
    
    ACTIVE JOBS--------------------
    JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME
    
    2244574            user1       Running    24     1:39:13  Tue Feb  5 09:42:17
    2344576            user2       Running   256    22:29:11  Tue Feb  5 19:21:17
    
       568 Active Jobs    5110 of 5512 Processors Active (92.71%)
                           188 of  190 Nodes Active      (98.95%)
    
    IDLE JOBS----------------------
    JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
    
    2340047               user2       Idle    32  2:14:30:00  Fri Feb  1 11:44:03
    2341327               user2       Idle    24 66:16:00:00  Sat Feb  2 08:53:00
    
    2 Idle Jobs
    
    BLOCKED JOBS----------------
    JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME
    
    2310628               user1       Idle    24 17:08:40:00  Sat Jan 26 14:41:25
    

Useful showq options include:

Option Description
-u username Display jobs for a particular user

tracejob

tracejob utility extracts job status and job events from accounting records. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter.

tracejob -n X job_id
    

where X stands for the number of days elapsed (at least) since the job finished. For example:

tracejob -n 10 130949
    

Deleting jobs

qdel

Deletes your job. It takes the job identifier as a parameter.

qdel 144833
    

To delete all your jobs:

qdel all
    

Some useful PBS/OS Environment Variables

You can use any of the following environment variables in your batch scripts:

Variable Description
PBS_O_WORKDIR Name of the directory from which the user submitted the job.
PBS_NODEFILE Contains a list of nodes assigned to the job.
PBS_JOBID Job’s PBS identifier.
USER Contains the username of the userthat submitted the job.