The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance
Chung-Hsuan Tung, Yanxiang Huang, Nirmal Saxena, Philip Shirvani, Saurabh Hukerikar, Twinkle Jain, Abhishek Tyagi, Sanjay Gongalore

TL;DR
This study investigates silent data corruption in GPUs through large-scale fault injection, revealing key characteristics to improve fault modeling and resilience strategies for GPU-based AI training.
Contribution
It provides detailed empirical data on GPU silent data corruption patterns, informing more accurate high-level fault models and fault injection methods.
Findings
NaN/+INF/-INF account for only 1.01% of SDC outcomes
Single-bit flips are less than 40% of bit-flip events
Corruption addresses exhibit periodicity
Abstract
Silent data corruption (SDC) threatens the reliability of large-scale GPU clusters used for training large language models, yet its rarity and lack of explicit error signals make accurate high-level modeling challenging. To address this gap, we conducted a large-scale gate-level stuck-at fault injection on a production-class data-center GPU, consuming over three million simulator hours across 63 CUDA micro-benchmarks. We extracted GPU SDC characteristics in terms of corruption types, bit-flip behavior, and warp-aligned spatial correlation. Our results show that NaN/+INF/-INF account for only 1.01% of SDC outcomes, that single-bit flips constitute less than 40% of bit-flip events, and that corruption addresses exhibit periodicity. These statistics motivate distribution-aware high-level fault modeling and realistic software-based fault injection for resilience evaluation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
