DiG: Enabling Out-of-Band Scalable High-Resolution Monitoring for Data-Center Analytics, Automation and Control (Extended)
Antonio Libri, Andrea Bartolini, Luca Benini

TL;DR
This paper introduces DiG, a scalable out-of-band monitoring platform for data centers that provides high-resolution, real-time profiling of performance and power consumption, enabling efficient analysis and control.
Contribution
The paper presents a novel out-of-band monitoring system that achieves high-resolution, real-time data collection for data centers, overcoming bandwidth and storage bottlenecks.
Findings
Supports sampling up to 20 microseconds with less than 1% error
Enables node-level and cluster-level analysis in real-time
Successfully deployed in a high-performance compute cluster
Abstract
Data centers are increasing in size and complexity, and we need scalable approaches to support their automated analysis and control. Performance counters and power consumption are their key "vital signs". State-of-the-Art (SoA) monitoring systems provide built-in tools to collect performance measurements, and custom solutions to get insight on their power consumption. However, with the increase in measurement resolution (in time and space) and the ensuing huge amount of measurement data to handle, new challenges arise, such as bottlenecks on the network bandwidth, storage and software overhead on the monitoring units. To face these challenges we propose a novel monitoring platform for data centers, which enables real-time high-resolution profiling (i.e., all available performance counters and the entire signal bandwidth of the power consumption at the plug - sampling up to 20us - with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
