Scalable Reliability Modelling of RAID Storage Subsystems
Prasenjit Karmakar, K. Gopinath

TL;DR
This paper introduces a scalable CTMC reliability model for RAID storage systems that accurately includes various components and failure modes, enabling efficient analysis of large configurations with up to 600 disks.
Contribution
The authors develop a scalable CTMC model for RAID systems that incorporates detailed component failure characteristics and uses state-space reduction techniques for practical computation.
Findings
Model scales to systems with up to 600 disks.
Uses Weibull and correlated failure modes for disks.
More practical than Monte Carlo simulations.
Abstract
Reliability modelling of RAID storage systems with its various components such as RAID controllers, enclosures, expanders, interconnects and disks is important from a storage system designer's point of view. A model that can express all the failure characteristics of the whole RAID storage system can be used to evaluate design choices, perform cost reliability trade-offs and conduct sensitivity analyses. However, including such details makes the computational models of reliability quickly infeasible. We present a CTMC reliability model for RAID storage systems that scales to much larger systems than heretofore reported and we try to model all the components as accurately as possible. We use several state-space reduction techniques at the user level, such as aggregating all in-series components and hierarchical decomposition, to reduce the size of our model. To automate computation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Distributed systems and fault tolerance
