Silent Data Corruption by 10x Test Escapes Threatens Reliable Computing
Subhasish Mitra, Subho Banerjee, Martin Dixon, Rama Govindaraju, Peter Hochschild, Eric X. Liu, Bharath Parthasarathy, Parthasarathy Ranganathan

TL;DR
This paper highlights the significant issue of silent data corruptions caused by test escapes in compute chips, proposing a comprehensive approach for diagnosis, in-field detection, and improved testing techniques to enhance reliability.
Contribution
It introduces a multi-faceted strategy for diagnosing, detecting, and testing defective chips to reduce silent data corruptions and improve manufacturing and operational reliability.
Findings
High rate of test escapes in data center chips
Proposed diagnosis method from system-level behaviors
New testing experiments to evaluate detection techniques
Abstract
Too many defective compute chips are escaping existing manufacturing tests -- at least an order of magnitude more than industrial targets across all compute chip types in data centers. Silent data corruptions (SDCs) caused by test escapes, when left unaddressed, pose a major threat to reliable computing. We present a three-pronged approach outlining future directions for overcoming test escapes: (a) Quick diagnosis of defective chips directly from system-level incorrect behaviors. Such diagnosis is critical for gaining insights into why so many defective chips escape existing manufacturing testing. (b) In-field detection of defective chips. (c) New test experiments to understand the effectiveness of new techniques for detecting defective chips. These experiments must overcome the drawbacks and pitfalls of previous industrial test experiments and case studies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Machine Learning and Data Classification
