Towards Energy Efficient Co-Scheduling in HPC
Zhong Zheng, Michael E. Papka, Zhiling Lan

TL;DR
EcoSched is an online scheduler that optimizes GPU count selection and application coscheduling to enhance energy efficiency and performance in multi GPU HPC systems.
Contribution
It introduces EcoSched, a novel runtime-aware scheduler that jointly optimizes GPU allocation and application placement for energy-efficient HPC workloads.
Findings
EcoSched achieves up to 14.8% energy savings.
It improves makespan by 30.1%.
Reduces EDP by 40.4%.
Abstract
Modern multi GPU HPC systems expose substantial computational capacity, yet inefficient GPU allocation often leads to wasted energy and underutilization. In practice, GPU applications exhibit heterogeneous and nonlinear scaling, making it inefficient to always use all available GPUs. We present EcoSched, an online scheduler that jointly optimizes GPU count selection and application coscheduling to improve workload level efficiency on multi GPU systems. EcoSched uses lightweight runtime profiling to estimate relative performance across GPU counts, applies a score based policy to balance energy efficiency and idle resources, and incorporates NUMA aware placement to mitigate interference. We implement EcoSched on heterogeneous CPU GPU platforms and evaluate it with diverse workloads on H100, A100, and V100 systems. EcoSched achieves up to 14.8% energy savings, 30.1% makespan improvement,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
