CEEMS: A Resource Manager Agnostic Energy and Emissions Monitoring Stack
Mahendra Paipuri (IDRIS)

TL;DR
CEEMS is an extensible, resource manager agnostic monitoring stack that provides real-time energy and emissions data for HPC and cloud workloads, supporting CPUs and GPUs, to promote energy-aware computing.
Contribution
It introduces CEEMS, a novel, extensible energy and emissions monitoring system compatible with various resource managers and hardware, integrated with open-source observability tools.
Findings
Capable of monitoring over 1400 nodes on Jean-Zay supercomputer
Supports real-time energy and emissions reporting for CPUs and GPUs
Successfully deployed with high job churn rate
Abstract
With the rapid acceleration of ML/AI research in the last couple of years, the energy consumption of the Information and Communication Technology (ICT) domain has rapidly increased. As a major part of this energy consumption is due to users' workloads, it is evident that users need to be aware of the energy footprint of their applications. Compute Energy and Emissions Monitoring Stack (CEEMS) has been designed to address this issue. CEEMS can report energy consumption and equivalent emissions of user workloads in real time for HPC and cloud platforms alike. Besides CPU energy usage, it supports reporting energy usage of workloads on NVIDIA and AMD GPU accelerators. CEEMS has been built around the prominent open-source tools in the observability eco-system like Prometheus and Grafana. CEEMS has been designed to be extensible and it allows the Data Center (DC) operators to easily define…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttentive Walk-Aggregating Graph Neural Network
