Optimization of Topology-Aware Job Allocation on a High-Performance Computing Cluster by Neural Simulated Annealing
Zekang Lan, Yan Xu, Yingkun Huang, Dian Huang, Shengzhong Feng

TL;DR
This paper introduces a neural simulated annealing approach for topology-aware job allocation on HPC clusters, aiming to reduce network interference and improve communication efficiency.
Contribution
It proposes a novel neural simulated annealing algorithm for dynamic job allocation, extending traditional simulated annealing with learned repair operators.
Findings
NSA outperforms standard SA and SCIP in experiments.
Both models effectively reduce inter-job network interference.
The approach improves communication hop costs in HPC job scheduling.
Abstract
Jobs on high-performance computing (HPC) clusters can suffer significant performance degradation due to inter-job network interference. Topology-aware job allocation problem (TJAP) is such a problem that decides how to dedicate nodes to specific applications to mitigate inter-job network interference. In this paper, we study the window-based TJAP on a fat-tree network aiming at minimizing the cost of communication hop, a defined inter-job interference metric. The window-based approach for scheduling repeats periodically taking the jobs in the queue and solving an assignment problem that maps jobs to the available nodes. Two special allocation strategies are considered, i.e., static continuity assignment strategy (SCAS) and dynamic continuity assignment strategy (DCAS). For the SCAS, a 0-1 integer programming is developed. For the DCAS, an approach called neural simulated algorithm…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Cloud Computing and Resource Management · Interconnection Networks and Systems
MethodsRepair
