Scalable HPC Job Scheduling and Resource Management in SST
Abubeker Abdurahman, Abrar Hossain, Kevin A Brown, Kazutomo Yoshii,, Kishwar Ahmed

TL;DR
This paper presents a scalable job scheduling and resource management simulator within SST, enhancing HPC system simulation accuracy and performance for scientific and data-intensive workflows.
Contribution
It introduces a new scalable simulation component with advanced scheduling algorithms and workflow management, validated for accuracy and parallel performance.
Findings
Achieves accurate simulation of job wait times and node usage.
Demonstrates good parallel scalability of the simulator.
Validates effectiveness of scheduling algorithms in HPC workflows.
Abstract
Efficient job scheduling and resource management contribute towards system throughput and efficiency maximization in high-performance computing (HPC) systems. In this paper, we introduce a scalable job scheduling and resource management component within the structural simulation toolkit (SST), a cycle-accurate and parallel discrete-event simulator. Our proposed simulator includes state-of-the-art job scheduling algorithms and resource management techniques. Additionally, it introduces workflow management components that support the simulation of task dependencies and resource allocations, crucial for workflows typical in scientific computing and data-intensive applications. We present the validation and scalability results of our job scheduling simulator. Simulation shows that our simulator achieves good accuracy in various metrics (e.g., job wait times, number of nodes usage) and also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
