Slurm GPU Job Scheduler¶
Introduction¶
SLURM (Simple Linux Utility for Resource Management) is a powerful and flexible workload manager and job scheduler. It is used to allocate resources, submit, monitor, and manage jobs on high-performance computing clusters.
This guide covers the basics of using SLURM, including submitting jobs, requesting resources, and monitoring their execution.
Table of Contents¶
- Submitting a Job with SLURM
- Basic SLURM Directives
- Example SLURM Job Script
- Running Array Jobs
- Monitoring Jobs
- Canceling Jobs
- Common SLURM Commands
Submitting a Job with SLURM¶
To submit a job in SLURM, you create a job script that includes directives telling SLURM what resources your job needs, how long it will take, where to write output, etc. This script is submitted using the sbatch command.
1 | |
Basic SLURM Directives¶
In the job script, directives are defined using the #SBATCH prefix, followed by the resource requests or configurations you need for your job.
Here are some common SLURM directives:
| Directive | Description |
|---|---|
--job-name=<name> |
Sets the job name for easier identification |
--output=<file> |
File to store standard output (use %j for job ID) |
--error=<file> |
File to store standard error (use %j for job ID) |
--ntasks=<num> |
Number of tasks (CPU cores) required |
--mem=<size> |
Memory required for the job (e.g., 4G, 10G, etc.) |
--time=<time> |
Maximum run time (format: days-hours:minutes:seconds) |
--partition=<name> |
Specify the partition or queue to use |
--gpus=<num> |
Number of GPUs required |
--array=<range> |
Job array (e.g., 0-10, creates 11 tasks) |
Example SLURM Job Script¶
Below is a simple example of a SLURM job script.
1 2 3 4 5 6 7 8 9 10 | |
In this script:
- The #SBATCH directives configure the job's resources.
- The srun command launches the program, which in this case runs a Python script.
Running Array Jobs¶
Array jobs allow you to submit multiple similar jobs with one submission. You can specify an array with the --array directive.
1 | |
In your script, you can use the environment variable $SLURM_ARRAY_TASK_ID to differentiate tasks in the array.
Example:
1 2 3 4 5 6 7 | |
Monitoring Jobs¶
To monitor your submitted jobs, you can use the following commands:
-
squeue: Shows the status of all jobs in the queue.1squeue -u <username> -
scontrol show job <job_id>: Shows detailed information about a specific job. -
sacct: Displays accounting information for your completed jobs.1sacct -j <job_id>
Canceling Jobs¶
You can cancel a running or pending job using the scancel command:
1 | |
To cancel an entire job array, you can omit the task ID, or use the specific task ID to cancel only one task:
1 2 | |
Common SLURM Commands¶
-
sbatch: Submits a job script.1sbatch my_job_script.slurm -
squeue: Displays information about jobs in the queue.1squeue -u <username> -
scancel: Cancels a job or set of jobs.1scancel <job_id> -
sinfo: Shows the status of partitions and nodes.1sinfo -
scontrol: Allows you to manage jobs and resources (e.g., show job details).1scontrol show job <job_id> -
srun: Runs parallel tasks within a SLURM job (not typically needed for single-node jobs).
Additional Guides¶
For further details and advanced usage, consult the official SLURM documentation.