TL;DR
This paper introduces an observability-aware early-warning framework for GPU failures that often lack numeric precursors, leveraging structural telemetry signals to improve detection and lead time.
Contribution
It proposes a joint modeling approach combining thermal drift and structural telemetry degradation indicators for early GPU failure detection.
Findings
Detachment failures show minimal numeric precursors.
Structural telemetry collapse is key to detecting failures.
Joint modeling improves early-warning lead time.
Abstract
GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity. This paper proposes an observability-aware early-warning framework that jointly models (i) utilization-aware thermal drift signatures in GPU telemetry and (ii) monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance. The framework is evaluated on production telemetry from GPU nodes at GWDG, where GPU, node,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
