HTCondor
HTCondor Guide¶
This guide provides instructions for using HTCondor, including cluster setup, job submission, troubleshooting, and useful commands.
Cluster Setup¶
Pool 1 - Research¶
- Central manager node:
pllimskhpc1 - Submit and Execute nodes:
pllimskhpc123
Pool 2 - Engineering¶
- Central manager node:
pllimsksparky1 - Submit and Execute nodes:
pllimsksparky1234
Example Submit File¶
Below is an example .sub file for submitting a training job with HTCondor:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
Example Workflow¶
- Prepare an executable (e.g., shell script, Python script, or Docker container) and a
.subfile. - Submit your job:
1condor_submit submit.sub - Monitor job status:
1condor_q - View detailed job analysis:
1condor_q -better-analyze <job_id> - Check output and logs (specified in the
.subfile). - Terminate a job:
1condor_rm <job_id>
Common HTCondor Commands¶
Check Nodes¶
1 2 3 4 | |
Check Jobs¶
1 2 3 | |
Debugging Jobs¶
1 | |
Job Submission Notes¶
-
To prevent overloading the cluster, set the following properties in your submit file:
max_idle: Maximum number of idle jobs in the queue.max_materialize: Maximum number of jobs to execute simultaneously.
-
To ensure compatibility with shared filesystems:
1 2
Requirements = TARGET.UidDomain == "mskcc.org" && \ TARGET.FileSystemDomain == "mskcc.org" -
To disable specific nodes:
1Requirements = (Machine != "pllimsksparky2.mskcc.org")
Troubleshooting¶
- Exec format error: Ensure the script starts with the correct shebang (e.g.,
#!/bin/bashfor Bash scripts). - MemoryError: Increase memory allocation in the submit file:
1request_memory = 16GB - Job stuck in hold: Use the following command to find the issue:
Check the
1condor_q -lHoldReasonkey for details.
Useful Guides¶
Caution: Always test your submit file and program with a small number of jobs before scaling up to prevent resource waste and low submission priority. Use
condor_submit -dry-runfor debugging.