Empirical Measurements of Disk Failure Rates and Error Rates
Jim Gray, Catharine van Ingen

TL;DR
This paper empirically measures disk failure and error rates in real-world scenarios, finding that SATA uncorrectable read errors are rare and that MTTDL is a more relevant metric for data integrity than UER.
Contribution
It provides large-scale empirical data on disk error rates and challenges the adequacy of UER as a reliability metric, advocating for MTTDL as more meaningful.
Findings
SATA uncorrectable read errors are infrequent compared to other failures.
System failures often caused by controller issues and system reboots, not just disk errors.
Uncorrectable errors often involve multiple damaged blocks and can be masked by OS.
Abstract
The SATA advertised bit error rate of one error in 10 terabytes is frightening. We moved 2 PB through low-cost hardware and saw five disk read error events, several controller failures, and many system reboots caused by security patches. We conclude that SATA uncorrectable read errors are not yet a dominant system-fault source - they happen, but are rare compared to other problems. We also conclude that UER (uncorrectable error rate) is not the relevant metric for our needs. When an uncorrectable read error happens, there are typically several damaged storage blocks (and many uncorrectable read errors.) Also, some uncorrectable read errors may be masked by the operating system. The more meaningful metric for data architects is Mean Time To Data Loss (MTTDL.)
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Parallel Computing and Optimization Techniques
