Silent Data Corruptions at Scale
Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason,, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar

TL;DR
Silent Data Corruptions (SDCs) pose significant risks to large-scale infrastructure, propagating unnoticed and causing data loss, but can be mitigated through hardware and software resilience strategies informed by extensive real-world testing.
Contribution
This paper provides a comprehensive analysis of SDCs, including defect types, a real-world case study, debugging methodologies, and mitigation strategies based on large-scale testing across data center infrastructure.
Findings
SDCs are systemic across CPU generations.
Hundreds of CPUs affected by silent errors in large-scale deployment.
Long-term monitoring reveals the importance of combined hardware and software resilience.
Abstract
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error reporting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time. In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a datacenter application. We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions within…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · VLSI and Analog Circuit Testing · Distributed systems and fault tolerance
