Empirical Measurements of Disk Failure Rates and Error Rates

Jim Gray; Catharine van Ingen

arXiv:cs/0701166·cs.DB·May 23, 2007·65 cites

Empirical Measurements of Disk Failure Rates and Error Rates

Jim Gray, Catharine van Ingen

PDF

Open Access

TL;DR

This paper empirically measures disk failure and error rates in real-world scenarios, finding that SATA uncorrectable read errors are rare and that MTTDL is a more relevant metric for data integrity than UER.

Contribution

It provides large-scale empirical data on disk error rates and challenges the adequacy of UER as a reliability metric, advocating for MTTDL as more meaningful.

Findings

01

SATA uncorrectable read errors are infrequent compared to other failures.

02

System failures often caused by controller issues and system reboots, not just disk errors.

03

Uncorrectable errors often involve multiple damaged blocks and can be masked by OS.

Abstract

The SATA advertised bit error rate of one error in 10 terabytes is frightening. We moved 2 PB through low-cost hardware and saw five disk read error events, several controller failures, and many system reboots caused by security patches. We conclude that SATA uncorrectable read errors are not yet a dominant system-fault source - they happen, but are rare compared to other problems. We also conclude that UER (uncorrectable error rate) is not the relevant metric for our needs. When an uncorrectable read error happens, there are typically several damaged storage blocks (and many uncorrectable read errors.) Also, some uncorrectable read errors may be masked by the operating system. The more meaningful metric for data architects is Mean Time To Data Loss (MTTDL.)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Parallel Computing and Optimization Techniques