CEEMS: A Resource Manager Agnostic Energy and Emissions Monitoring Stack

Mahendra Paipuri (IDRIS)

arXiv:2412.07290·eess.SY·December 11, 2024

CEEMS: A Resource Manager Agnostic Energy and Emissions Monitoring Stack

Mahendra Paipuri (IDRIS)

PDF

TL;DR

CEEMS is an extensible, resource manager agnostic monitoring stack that provides real-time energy and emissions data for HPC and cloud workloads, supporting CPUs and GPUs, to promote energy-aware computing.

Contribution

It introduces CEEMS, a novel, extensible energy and emissions monitoring system compatible with various resource managers and hardware, integrated with open-source observability tools.

Findings

01

Capable of monitoring over 1400 nodes on Jean-Zay supercomputer

02

Supports real-time energy and emissions reporting for CPUs and GPUs

03

Successfully deployed with high job churn rate

Abstract

With the rapid acceleration of ML/AI research in the last couple of years, the energy consumption of the Information and Communication Technology (ICT) domain has rapidly increased. As a major part of this energy consumption is due to users' workloads, it is evident that users need to be aware of the energy footprint of their applications. Compute Energy and Emissions Monitoring Stack (CEEMS) has been designed to address this issue. CEEMS can report energy consumption and equivalent emissions of user workloads in real time for HPC and cloud platforms alike. Besides CPU energy usage, it supports reporting energy usage of workloads on NVIDIA and AMD GPU accelerators. CEEMS has been built around the prominent open-source tools in the observability eco-system like Prometheus and Grafana. CEEMS has been designed to be extensible and it allows the Data Center (DC) operators to easily define…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttentive Walk-Aggregating Graph Neural Network