Exploring Erasure Coding Techniques for High Availability of Intermediate Data
Zhe Zhang, Brian Bockelman, Derek Weitzel, David Swanson

TL;DR
This paper investigates the use of erasure coding techniques for storing intermediate data in scientific workflows, aiming to improve storage efficiency and data availability compared to traditional replication methods.
Contribution
It introduces algorithms for proactive data redundancy relocation and physical placement to enhance data durability and reduce network bandwidth during data reconstruction.
Findings
Erasure codes reduce storage requirements compared to replication.
Proactive redundancy relocation improves data availability.
Physical placement algorithms decrease network bandwidth for data reconstruction.
Abstract
Scientific computing workflows generate enormous distributed data that is short-lived, yet critical for job completion time. This class of data is called intermediate data. A common way to achieve high data availability is to replicate data. However, an increasing scale of intermediate data generated in modern scientific applications demands new storage techniques to improve storage efficiency. Erasure Codes, as an alternative, can use less storage space while maintaining similar data availability. In this paper, we adopt erasure codes for storing intermediate data and compare its performance with replication. We also use the metric of Mean-Time-To-Data-Loss (MTTDL) to estimate the lifetime of intermediate data. We propose an algorithm to proactively relocate data redundancy from vulnerable machines to reliable ones to improve data availability with some extra network overhead.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems
