TORQUE Resource Manager
TORQUE is a resource management system for submitting and controlling batch jobs on supercomputers.
Batch jobs are typically submitted using a batch script which is a text file containing a number of job directives and GNU/linux commands or utilities. Batch scripts are submitted to the batch system (TORQUE), where they are queued awaiting free resources. The batch system has two components on Atlas:
Torque as the resource manager is in charge of:
- Accepting and starting jobs/tasks across a batch farm (
qsub
). - Cancelling jobs (
qdel
). - Monitoring jobs (
qstat
). - Accounting.
MAUI as the scheduler is in charge of scheduling the jobs.
TORQUE includes numerous directives, which are used to specify resource requirements and other attributes for batch and interactive jobs. TORQUE directives can appear as header lines (lines that start with #PBS) in a batch job script or as command-line options to the qsub
command.
Here an example of a generic batch script:
#!/bin/bash
#PBS -q parallel
#PBS -l nodes=1:ppn=24
#PBS -l mem=100gb
#PBS -l cput=1000:00:00
#PBS -N JOB_NAME
echo 'Hello DIPC!'
This batch script example can be read line by line as follows:
#!/bin/bash
: run the job under the shell BASH.#PBS -q parallel
: send to the parallel queue.#PBS -l nodes=1:ppn=24
: make a reservation of 1 node and 24 cores per node.#PBS -l mem=100gb
: make a reservation of 100 GB of RAM memory.#PBS -l cput=1:00:00
: make a reservation of 1 hour of CPU time.#PBS -N Hello_DIPC
: give a name to the job.echo 'Hello DIPC!'
: actual piece of code we want to run.
Hint
If you do not know the amount of resources in terms of memory or time your jobs are going to need, you should overestimate this values in the first runs and tweak those values up as you learn how jobs behave.
Once our batch script is prepared you can submit it as a batch job to the queue system.
Hint
Everytime you open a shell session you will land on your home directory. In the computing nodes this happens to be /scratch/username
. Therefore, if your input files are located elsewhere, take this into consideration before running your application in the batch script, that is, you may need to cd
into the directory where your input files are stored. TORQUE creates a special variable named $PBS_O_WORKDIR
that points to the directory from which the job was submitted. Make use of it to keep your scripts as general and reusable as it is possible.
Hint
If you do not redirect the output of your runs to a specific file, then the standard output will be redirected to a file that goes by the name YOUR_JOBS_NAME.oXXXXXXX`
. Where YOUR_JOBS_NAME
is the name you gave your job in the batch script and XXXXXXXX
the job identifier.
Similarly, the error output get redirected to a file that goes by the name YOUR_JOBS_NAME.eXXXXXXX
.
Error and standard output files are created withing the $PBS_O_WORKDIR
by default and as long as it is not specified otherwise.
You can find examples of batch scripts taylored for a particular HPC system on their devoted webpage.
How to submit jobs¶
You will typically want to submit your jobs from a directory located under your /scratch
directory. This is why before anything you will need to move or copy all your necessary files to this filesystem.
qsub¶
To submit your job script (for example, batch_script.pbs), use qsub
command.
qsub batch_script.pbs
2345456.maui01
As a result of invoking the command, TORQUE will return the a job tag or identifier.
Monitoring your jobs¶
qstat¶
qstat
command shows the status of the queue/jobs. Adding some options can enrich the output to display even more information:
qstat -a
maui01: DIPC ATLAS cluster
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
321742.maui01 user1 p-slow-l job1 28613 1 24 30gb 10000:00: R 3895:26:0
375137.maui01 user2 p-slow-l job1 15691 1 24 40gb 1000000:0 R 4674:41:4
376643.maui01 user1 p-slow-l job2 164095 1 12 50gb 10000:00: R 2159:42:2
376646.maui01 user3 p-slow-l superjob 22318 1 12 50gb 10000:00: R 2131:58:2
Useful qstat
options include:
Option Description -u username Display jobs for a particular user -a Display all jobs. -f job_id Display detailed information of one particular job -n Display jobs and computing nodes
showq¶
showq
shows the status of the queue. It provides useful information such as the remaining time of the job.
showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
2244574 user1 Running 24 1:39:13 Tue Feb 5 09:42:17
2344576 user2 Running 256 22:29:11 Tue Feb 5 19:21:17
568 Active Jobs 5110 of 5512 Processors Active (92.71%)
188 of 190 Nodes Active (98.95%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
2340047 user2 Idle 32 2:14:30:00 Fri Feb 1 11:44:03
2341327 user2 Idle 24 66:16:00:00 Sat Feb 2 08:53:00
2 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
2310628 user1 Idle 24 17:08:40:00 Sat Jan 26 14:41:25
Useful showq
options include:
Option Description -u username Display jobs for a particular user
tracejob¶
tracejob
utility extracts job status and job events from accounting records. Using it can help identify where, how, a why a job failed. This tool takes a job id as a parameter.
tracejob -n X job_id
where X
stands for the number of days elapsed (at least) since the job finished. For example:
tracejob -n 10 130949
Deleting jobs¶
qdel¶
Deletes your job. It takes the job identifier as a parameter.
qdel 144833
To delete all your jobs:
qdel all
Some useful PBS/OS Environment Variables¶
You can use any of the following environment variables in your batch scripts:
Variable Description PBS_O_WORKDIR Name of the directory from which the user submitted the job. PBS_NODEFILE Contains a list of nodes assigned to the job. PBS_JOBID Job’s PBS identifier. USER Contains the username of the userthat submitted the job.