Live Recovery of Bit Corruptions in Datacenter Storage Systems
Amy Tai, Andrew Kryczka, Shobhit Kanaujia, Chris Petersen, Mikhail, Antonov, Muhammad Waliji, Kyle Jamieson, Michael J. Freedman, Asaf Cidon

TL;DR
This paper introduces DIRECT, a set of policies that leverage redundancy in distributed storage to recover from flash memory bit errors, significantly extending device lifetime and improving error tolerance in datacenter storage systems.
Contribution
The paper presents DIRECT, a novel approach that uses latent redundancy to recover from bit errors, enabling higher error rates and longer flash device lifetimes in real-world systems.
Findings
Reduces application-visible error rates in ZippyDB by over 100 times
Decreases recovery time by more than 10,000 times
Enables HDFS to tolerate 10,000 to 100,000 times higher bit error rates
Abstract
Due to its high performance and decreasing cost per bit, flash is becoming the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. In this paper we propose extending flash lifetime by allowing devices to expose higher bit error rates. To do so, we present DIRECT, a novel set of policies that leverages latent redundancy in distributed storage systems to recover from bit corruption errors with minimal performance and recovery overhead. In doing so, DIRECT can significantly extend the lifetime of flash devices by effectively utilizing these devices even after they begin exposing bit errors. We implemented DIRECT on two real-world storage systems: ZippyDB, a distributed key-value store…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Caching and Content Delivery · Cloud Computing and Resource Management
