XORing Elephants: Novel Erasure Codes for Big Data
Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris, Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, Dhruba, Borthakur

TL;DR
This paper introduces a new family of erasure codes that significantly reduce repair bandwidth and time in distributed storage systems, offering higher reliability than Reed-Solomon codes with a modest storage overhead.
Contribution
The paper presents a novel erasure coding scheme that is optimally local and more repair-efficient, improving reliability and reducing repair costs in large-scale storage systems.
Findings
Approximately 2x reduction in repair disk I/O
Approximately 2x reduction in repair network traffic
14% more storage overhead, which is information-theoretically optimal
Abstract
Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability. This paper shows how to overcome this limitation. We present a novel family of erasure codes that are efficiently repairable and offer higher reliability compared to Reed-Solomon codes. We show analytically that our codes are optimal on a recently identified tradeoff between locality and minimum distance. We implement our new codes in Hadoop HDFS and compare to a currently deployed HDFS module that uses Reed-Solomon codes. Our modified HDFS implementation shows a reduction of approximately 2x on the repair disk…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Caching and Content Delivery · Distributed systems and fault tolerance
