When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

Michael Bidollahkhani; Freja Nordsiek; Julian M. Kunkel

arXiv:2603.28781·cs.DC·April 7, 2026

When GPUs Fail Quietly: Observability-Aware Early Warning Beyond Numeric Telemetry

Michael Bidollahkhani, Freja Nordsiek, Julian M. Kunkel

PDF

1 Repo

TL;DR

This paper introduces an observability-aware early-warning framework for GPU failures that often lack numeric precursors, leveraging structural telemetry signals to improve detection and lead time.

Contribution

It proposes a joint modeling approach combining thermal drift and structural telemetry degradation indicators for early GPU failure detection.

Findings

01

Detachment failures show minimal numeric precursors.

02

Structural telemetry collapse is key to detecting failures.

03

Joint modeling improves early-warning lead time.

Abstract

GPU nodes are central to modern HPC and AI workloads, yet many failures do not manifest as immediate hard faults. While some instabilities emerge gradually as weak thermal or efficiency drift, a significant class occurs abruptly with little or no numeric precursor. In these detachment-class failures, GPUs become unavailable at the driver or interconnect level and the dominant observable signal is structural, including disappearance of device metrics and degradation of monitoring payload integrity. This paper proposes an observability-aware early-warning framework that jointly models (i) utilization-aware thermal drift signatures in GPU telemetry and (ii) monitoring-pipeline degradation indicators such as scrape latency increase, sample loss, time-series gaps, and device-metric disappearance. The framework is evaluated on production telemetry from GPU nodes at GWDG, where GPU, node,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://doi.org/10.5281/zenodo.19052367
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.