SPARS: A Reinforcement Learning-Enabled Simulator for Power Management in HPC Job Scheduling
Muhammad Alfian Amrizal, Raka Satya Prasasta, Santana Yuda Pradata, Kadek Gemilang Santiyuda, Reza Pulungan, Hiroyuki Takizawa

TL;DR
SPARS is a flexible, reinforcement learning-enabled simulation platform for optimizing power management in HPC job scheduling, balancing energy efficiency with performance.
Contribution
It introduces a lightweight, modular simulator that integrates RL-based power management with traditional scheduling policies for HPC clusters.
Findings
Supports various scheduling policies including RL-enhanced variants.
Provides detailed metrics and visualizations for analysis.
Facilitates reproducible experiments and easy extension.
Abstract
High-performance computing (HPC) clusters consume enormous amounts of energy, with idle nodes as a major source of waste. Powering down unused nodes can mitigate this problem, but poorly timed transitions introduce long delays and reduce overall performance. To address this trade-off, we present SPARS, a reinforcement learning-enabled simulator for power management in HPC job scheduling. SPARS integrates job scheduling and node power state management within a discrete-event simulation framework. It supports traditional scheduling policies such as First Come First Served and EASY Backfilling, along with enhanced variants that employ reinforcement learning agents to dynamically decide when nodes should be powered on or off. Users can configure workloads and platforms in JSON format, specifying job arrivals, execution times, node power models, and transition delays. The simulator records…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Distributed and Parallel Computing Systems · Parallel Computing and Optimization Techniques
