EcoShift: Performance-Aware Power Management for Power-Constrained Heterogeneous Systems
Zhong Zheng, Michael E. Papka, Zhiling Lan

TL;DR
EcoShift is a novel power management framework for heterogeneous HPC systems that optimally allocates power to maximize performance under strict power limits.
Contribution
EcoShift introduces an online performance prediction and dynamic programming approach for power allocation in heterogeneous systems, improving efficiency over existing policies.
Findings
EcoShift achieves up to 6% average performance improvement.
EcoShift outperforms state-of-the-art policies in emulation-based evaluations.
EcoShift effectively manages power distribution for CPU-GPU workloads.
Abstract
Power-constrained HPC systems increasingly run heterogeneous CPU--GPU applications under strict cluster-wide power limits. Existing cluster-wide power management policies rely on fair-share or utilization heuristics and do not capture application-specific sensitivity to CPU and GPU power caps, leading to inefficient use of reclaimed power. We present EcoShift, a performance-aware cluster-wide power management framework. EcoShift combines online performance prediction with a dynamic-programming-based allocator to distribute reclaimed power across CPU--GPU applications for maximum average performance improvement. Through emulation-based evaluation on two heterogeneous Intel CPU and NVIDIA A100/H100 GPU platforms with diverse CPU--GPU workloads, EcoShift consistently outperforms state-of-the-art policies, achieving up to 6% average performance improvement while preserving the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
