eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems
Ruilin Xu, Zongxuan Xie, Pengfei Chen

TL;DR
eACGM is a comprehensive, non-intrusive system monitoring framework for AI/ML that uses eBPF and statistical modeling to detect performance anomalies across hardware and software components in real-time.
Contribution
It introduces a full-stack, code-instrumentation-free monitoring system that combines hardware and software performance data with GMM-based anomaly detection for AI/ML systems.
Findings
Successfully detects latency anomalies and hardware failures.
Maintains low overhead during distributed training.
Proves effective in real-world production environments.
Abstract
We present eACGM, a full-stack AI/ML system monitoring framework based on eBPF. eACGM collects real-time performance data from key hardware components, including the GPU and network communication layer, as well as from key software stacks such as CUDA, Python, and PyTorch, all without requiring any code instrumentation or modifications. Additionally, it leverages libnvml to gather process-level GPU resource usage information. By applying a Gaussian Mixture Model (GMM) to the collected multidimensional performance metrics for statistical modeling and clustering analysis, eACGM effectively identifies complex failure modes, such as latency anomalies, hardware failures, and communication inefficiencies, enabling rapid diagnosis of system bottlenecks and abnormal behaviors. To evaluate eACGM's effectiveness and practicality, we conducted extensive empirical studies and case analyses in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Fault Detection and Control Systems
