Exploring Silent Data Corruption as a Reliability Challenge in LLM Training

Anton Altenbernd; Philipp Wiesner; Odej Kao

arXiv:2604.00726·cs.LG·April 2, 2026

Exploring Silent Data Corruption as a Reliability Challenge in LLM Training

Anton Altenbernd, Philipp Wiesner, Odej Kao

PDF

TL;DR

This paper investigates how silent data corruption during GPU-based training of large language models can cause significant training issues, and proposes a detection method to mitigate these effects.

Contribution

It provides a controlled fault injection study of SDC in LLM training and introduces a lightweight detection approach to improve training robustness.

Findings

01

Faults in GPU matrix operations can cause loss spikes and divergence.

02

Recomputing recent training steps can mitigate SDC impacts.

03

Certain bit positions and kernel functions are more sensitive to faults.

Abstract

As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress. This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.