The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

Chung-Hsuan Tung; Yanxiang Huang; Nirmal Saxena; Philip Shirvani; Saurabh Hukerikar; Twinkle Jain; Abhishek Tyagi; Sanjay Gongalore

arXiv:2605.04213·cs.AR·May 7, 2026

The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

Chung-Hsuan Tung, Yanxiang Huang, Nirmal Saxena, Philip Shirvani, Saurabh Hukerikar, Twinkle Jain, Abhishek Tyagi, Sanjay Gongalore

PDF

TL;DR

This study investigates silent data corruption in GPUs through large-scale fault injection, revealing key characteristics to improve fault modeling and resilience strategies for GPU-based AI training.

Contribution

It provides detailed empirical data on GPU silent data corruption patterns, informing more accurate high-level fault models and fault injection methods.

Findings

01

NaN/+INF/-INF account for only 1.01% of SDC outcomes

02

Single-bit flips are less than 40% of bit-flip events

03

Corruption addresses exhibit periodicity

Abstract

Silent data corruption (SDC) threatens the reliability of large-scale GPU clusters used for training large language models, yet its rarity and lack of explicit error signals make accurate high-level modeling challenging. To address this gap, we conducted a large-scale gate-level stuck-at fault injection on a production-class data-center GPU, consuming over three million simulator hours across 63 CUDA micro-benchmarks. We extracted GPU SDC characteristics in terms of corruption types, bit-flip behavior, and warp-aligned spatial correlation. Our results show that NaN/+INF/-INF account for only 1.01% of SDC outcomes, that single-bit flips constitute less than 40% of bit-flip events, and that corruption addresses exhibit periodicity. These statistics motivate distribution-aware high-level fault modeling and realistic software-based fault injection for resilience evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.