A Real-Time Digital Twin for Adaptive Scheduling
Yihe Zhang, Yash Kurkure, Yiheng Tao, Michael E. Papka, Zhiling Lan

TL;DR
This paper introduces SchedTwin, a real-time digital twin that adaptively guides HPC cluster scheduling by predictive simulation, outperforming static policies with low overhead.
Contribution
The paper presents SchedTwin, a novel real-time digital twin system that dynamically optimizes HPC scheduling using predictive simulation and integration with existing schedulers.
Findings
SchedTwin outperforms static scheduling policies.
Maintains low overhead of a few seconds per cycle.
Demonstrates practical effectiveness for adaptive HPC scheduling.
Abstract
High-performance computing (HPC) workloads are becoming increasingly diverse, exhibiting wide variability in job characteristics, yet cluster scheduling has long relied on static, heuristic-based policies. In this work we present SchedTwin, a real-time digital twin designed to adaptively guide scheduling decisions using predictive simulation. SchedTwin periodically ingests runtime events from the physical scheduler, performs rapid what-if evaluations of multiple policies using a high-fidelity discrete-event simulator, and dynamically selects the one satisfying the administrator configured optimization goal. We implement SchedTwin as an open-source software and integrate it with the production PBS scheduler. Preliminary results show that SchedTwin consistently outperforms widely used static scheduling policies, while maintaining low overhead (a few seconds per scheduling cycle). These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · IoT and Edge/Fog Computing
