HTCondor
HTCondor Guide¶
This guide provides instructions for using HTCondor, including cluster setup, job submission, troubleshooting, and useful commands.
Cluster Setup¶
Pool 1 - Research¶
- Central manager node:
pllimskhpc1
- Submit and Execute nodes:
pllimskhpc123
Pool 2 - Engineering¶
- Central manager node:
pllimsksparky1
- Submit and Execute nodes:
pllimsksparky1234
Example Submit File¶
Below is an example .sub
file for submitting a training job with HTCondor:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
Example Workflow¶
- Prepare an executable (e.g., shell script, Python script, or Docker container) and a
.sub
file. - Submit your job:
1
condor_submit submit.sub
- Monitor job status:
1
condor_q
- View detailed job analysis:
1
condor_q -better-analyze <job_id>
- Check output and logs (specified in the
.sub
file). - Terminate a job:
1
condor_rm <job_id>
Common HTCondor Commands¶
Check Nodes¶
1 2 3 4 |
|
Check Jobs¶
1 2 3 |
|
Debugging Jobs¶
1 |
|
Job Submission Notes¶
-
To prevent overloading the cluster, set the following properties in your submit file:
max_idle
: Maximum number of idle jobs in the queue.max_materialize
: Maximum number of jobs to execute simultaneously.
-
To ensure compatibility with shared filesystems:
1 2
Requirements = TARGET.UidDomain == "mskcc.org" && \ TARGET.FileSystemDomain == "mskcc.org"
-
To disable specific nodes:
1
Requirements = (Machine != "pllimsksparky2.mskcc.org")
Troubleshooting¶
- Exec format error: Ensure the script starts with the correct shebang (e.g.,
#!/bin/bash
for Bash scripts). - MemoryError: Increase memory allocation in the submit file:
1
request_memory = 16GB
- Job stuck in hold: Use the following command to find the issue:
Check the
1
condor_q -l
HoldReason
key for details.
Useful Guides¶
Caution: Always test your submit file and program with a small number of jobs before scaling up to prevent resource waste and low submission priority. Use
condor_submit -dry-run
for debugging.