Exploring Silent Data Corruption as a Reliability Challenge in LLM Training
Anton Altenbernd, Philipp Wiesner, Odej Kao

TL;DR
This paper investigates how silent data corruption during GPU-based training of large language models can cause significant training issues, and proposes a detection method to mitigate these effects.
Contribution
It provides a controlled fault injection study of SDC in LLM training and introduces a lightweight detection approach to improve training robustness.
Findings
Faults in GPU matrix operations can cause loss spikes and divergence.
Recomputing recent training steps can mitigate SDC impacts.
Certain bit positions and kernel functions are more sensitive to faults.
Abstract
As Large Language Models (LLMs) scale in size and complexity, the consequences of failures during training become increasingly severe. A major challenge arises from Silent Data Corruption (SDC): hardware-induced faults that bypass system-level detection mechanisms. SDC may behave like benign numerical noise, but can also cause harmful gradient corruption that leads to loss spikes, divergence, or stalled progress. This work provides a controlled study of how intermittent SDC affects LLM pretraining. Using targeted fault injection at the level of GPU matrix-multiply instructions, we characterize the sensitivity of different bit positions, kernel functions, and execution stages. Our analysis shows that locally originating faults can produce impactful corruption, including NaN propagation, short-lived spikes in loss, gradient norm, and attention logits, as well as persistent parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
