A Reinforcement Learning Approach for Performance-aware Reduction in   Power Consumption of Data Center Compute Nodes

Akhilesh Raj; Swann Perarnau; Aniruddha Gokhale

arXiv:2308.08069·cs.DC·August 17, 2023·1 cites

A Reinforcement Learning Approach for Performance-aware Reduction in Power Consumption of Data Center Compute Nodes

Akhilesh Raj, Swann Perarnau, Aniruddha Gokhale

PDF

Open Access 1 Repo

TL;DR

This paper presents a reinforcement learning-based method to dynamically manage power consumption in data center compute nodes, aiming to reduce energy use without degrading application performance.

Contribution

It introduces a novel RL-based power capping policy that balances energy efficiency and performance using real-time system observations and hardware controls.

Findings

01

RL agent effectively reduces power consumption

02

Maintains application performance during power capping

03

Demonstrates practical implementation on real hardware

Abstract

As Exascale computing becomes a reality, the energy needs of compute nodes in cloud data centers will continue to grow. A common approach to reducing this energy demand is to limit the power consumption of hardware components when workloads are experiencing bottlenecks elsewhere in the system. However, designing a resource controller capable of detecting and limiting power consumption on-the-fly is a complex issue and can also adversely impact application performance. In this paper, we explore the use of Reinforcement Learning (RL) to design a power capping policy on cloud compute nodes using observations on current power consumption and instantaneous application performance (heartbeats). By leveraging the Argo Node Resource Management (NRM) software stack in conjunction with the Intel Running Average Power Limit (RAPL) hardware control mechanism, we design an agent to control the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

akhileshraj91/generalized_rl_anl
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · IoT and Edge/Fog Computing · Blockchain Technology Applications and Security