An Autonomy Loop for Dynamic HPC Job Time Limit Adjustment
Thomas Jakobsche, Osman Seckin Simsek, Jim Brandt, Ann Gentile, Florina M. Ciorba

TL;DR
This paper introduces a feedback-driven autonomy loop that dynamically adjusts HPC job time limits based on checkpoint progress, significantly reducing resource waste and improving scheduling efficiency.
Contribution
It presents a novel autonomy loop for HPC job scheduling that adapts time limits in real-time, a new approach to reducing tail waste and enhancing resource utilization.
Findings
95% reduction in tail waste
Approximately 1.3% of total CPU time saved
Improved scheduling metrics such as job wait time
Abstract
High Performance Computing (HPC) systems rely on fixed user-provided estimates of job time limits. These estimates are often inaccurate, resulting in inefficient resource use and the loss of unsaved work if a job times out shortly before reaching its next checkpoint. This work proposes a novel feedback-driven autonomy loop that dynamically adjusts HPC job time limits based on checkpoint progress reported by applications. Our approach monitors checkpoint intervals and queued jobs, enabling informed decisions to either early cancel a job after its last completed checkpoint or extend the time limit sufficiently to accommodate the next checkpoint. The objective is to minimize tail waste, that is, the computation that occurs between the last checkpoint and the termination of a job, which is not saved and hence wasted. Through experiments conducted on a subset of a production workload trace,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
