Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

Glen MacLachlan; Joseph Creech; Rubeel Muhammad Iqbal; Clark Gaylord; Jake Messick

arXiv:2603.27863·cs.DC·March 31, 2026

Operational Strategies for Non-Disruptive Scheduling Transitions in Production HPC Systems

Glen MacLachlan, Joseph Creech, Rubeel Muhammad Iqbal, Clark Gaylord, Jake Messick

PDF

TL;DR

This paper details a case study on transitioning a production HPC cluster to resource-aware scheduling without disrupting workflows, using a combination of compatibility layers, feedback, and user engagement.

Contribution

It introduces an operational strategy that enables seamless scheduling transitions in HPC systems by integrating observability, user engagement, and targeted operational design.

Findings

01

Queue wait times reduced from 277 to under 3 minutes for CPU workloads.

02

GPU workload wait times decreased from 81 to 3.4 minutes.

03

Users adopting TRES-based submission showed strong long-term retention.

Abstract

Migrating heterogeneous high-performance computing (HPC) systems to resource-aware scheduling introduces both technical and behavioral challenges, particularly in production environments with established user workflows. This paper presents a case study of transitioning a production academic HPC cluster from node-exclusive to consumable resource scheduling mid-lifecycle, without disrupting active workloads. We describe an operational strategy combining a time-bounded compatibility layer, observability-driven feedback, and targeted user engagement to guide adoption of explicit resource declaration. This approach protected active research workflows throughout the transition, avoiding the disruption that a direct cut-over would have imposed on the user community. Following deployment, median queue wait times fell from 277 minutes to under 3 minutes for CPU workloads and from 81 minutes to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.