Slurm GPU Job Scheduler¶
Introduction¶
SLURM (Simple Linux Utility for Resource Management) is a powerful and flexible workload manager and job scheduler. It is used to allocate resources, submit, monitor, and manage jobs on high-performance computing clusters.
This guide covers the basics of using SLURM, including submitting jobs, requesting resources, and monitoring their execution.
Table of Contents¶
- Submitting a Job with SLURM
- Basic SLURM Directives
- Example SLURM Job Script
- Running Array Jobs
- Monitoring Jobs
- Canceling Jobs
- Common SLURM Commands
Submitting a Job with SLURM¶
To submit a job in SLURM, you create a job script that includes directives telling SLURM what resources your job needs, how long it will take, where to write output, etc. This script is submitted using the sbatch
command.
1 |
|
Basic SLURM Directives¶
In the job script, directives are defined using the #SBATCH
prefix, followed by the resource requests or configurations you need for your job.
Here are some common SLURM directives:
Directive | Description |
---|---|
--job-name=<name> |
Sets the job name for easier identification |
--output=<file> |
File to store standard output (use %j for job ID) |
--error=<file> |
File to store standard error (use %j for job ID) |
--ntasks=<num> |
Number of tasks (CPU cores) required |
--mem=<size> |
Memory required for the job (e.g., 4G, 10G, etc.) |
--time=<time> |
Maximum run time (format: days-hours:minutes:seconds ) |
--partition=<name> |
Specify the partition or queue to use |
--gpus=<num> |
Number of GPUs required |
--array=<range> |
Job array (e.g., 0-10 , creates 11 tasks) |
Example SLURM Job Script¶
Below is a simple example of a SLURM job script.
1 2 3 4 5 6 7 8 9 10 |
|
In this script:
- The #SBATCH
directives configure the job's resources.
- The srun
command launches the program, which in this case runs a Python script.
Running Array Jobs¶
Array jobs allow you to submit multiple similar jobs with one submission. You can specify an array with the --array
directive.
1 |
|
In your script, you can use the environment variable $SLURM_ARRAY_TASK_ID
to differentiate tasks in the array.
Example:
1 2 3 4 5 6 7 |
|
Monitoring Jobs¶
To monitor your submitted jobs, you can use the following commands:
-
squeue
: Shows the status of all jobs in the queue.1
squeue -u <username>
-
scontrol show job <job_id>
: Shows detailed information about a specific job. -
sacct
: Displays accounting information for your completed jobs.1
sacct -j <job_id>
Canceling Jobs¶
You can cancel a running or pending job using the scancel
command:
1 |
|
To cancel an entire job array, you can omit the task ID, or use the specific task ID to cancel only one task:
1 2 |
|
Common SLURM Commands¶
-
sbatch
: Submits a job script.1
sbatch my_job_script.slurm
-
squeue
: Displays information about jobs in the queue.1
squeue -u <username>
-
scancel
: Cancels a job or set of jobs.1
scancel <job_id>
-
sinfo
: Shows the status of partitions and nodes.1
sinfo
-
scontrol
: Allows you to manage jobs and resources (e.g., show job details).1
scontrol show job <job_id>
-
srun
: Runs parallel tasks within a SLURM job (not typically needed for single-node jobs).
Additional Guides¶
For further details and advanced usage, consult the official SLURM documentation.