Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems
Glen MacLachlan, Joseph Creech, Rubeel Muhammad Iqbal, Clark Gaylord, Jake Messick

TL;DR
This paper details a case study on transitioning a production HPC cluster to resource-aware scheduling without disrupting workflows, using a combination of compatibility layers, feedback, and user engagement.
Contribution
It introduces an operational strategy that enables seamless scheduling transitions in HPC systems by integrating observability, user engagement, and targeted operational design.
Findings
Queue wait times reduced from 277 to under 3 minutes for CPU workloads.
GPU workload wait times decreased from 81 to 3.4 minutes.
Users adopting TRES-based submission showed strong long-term retention.
Abstract
Migrating heterogeneous high-performance computing (HPC) systems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workflows. This paper presents a case study of transitioning a production academic HPC cluster from node-exclusive to consumable resource scheduling mid-lifecycle, without disrupting active workloads. We describe an operational strategy combining a time-bounded compatibility layer, observability-driven feedback, and targeted user engagement to guide adoption of explicit resource declaration. This approach protected active research workflows throughout the transition, avoiding the disruption that a direct cut-over would have imposed on the user community. Following deployment, median queue wait times fell from 277 minutes to under 3 minutes for CPU workloads and from 81 minutes to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
