A Reliability Model for Dependent and Distributed MDS Disk Array Units
Suayb S. Arslan

TL;DR
This paper develops a Markov chain-based failure model for large-scale distributed storage systems using MDS erasure codes, addressing correlated failures and reliability in multi-parity disk arrays.
Contribution
It introduces a novel failure model that captures correlated failures and system lifecycle patterns, improving reliability analysis of distributed MDS-coded storage systems.
Findings
Correlated failures significantly impact system reliability.
Adding more parity disks is beneficial only with sufficient failure domain decorrelation.
The model accurately predicts failure behaviors in large distributed storage environments.
Abstract
Archiving and systematic backup of large digital data generates a quick demand for multi-peta byte scale storage systems. As drive capacities continue to grow beyond the few terabytes range to address the demands of today's cloud, the likelihood of having multiple/simultaneous disk failures become a reality. Among the main factors causing catastrophic system failures, correlated disk failures and the network bandwidth are reported to be the two common source of performance degradation. The emerging trend is to use efficient/sophisticated erasure codes (EC) equipped with multiple parities and efficient repairs in order to meet the reliability/bandwidth requirements. It is known that mean time to failure and repair rates reported by the disk manufacturers cannot capture life cycle patterns of distributed storage systems. In this study, we develop failure models based on generalized Markov…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Distributed and Parallel Computing Systems
