Modeling Impact of Human Errors on the Data Unavailability and Data Loss of Storage Systems
Mostafa Kishani, Hossein Asadi

TL;DR
This paper models how human errors, specifically incorrect disk replacements, significantly impact data availability and loss in storage systems, revealing that traditional assumptions about RAID dependability may be overly optimistic.
Contribution
It introduces a novel modeling framework that incorporates human error effects into storage system reliability analysis, supported by Monte Carlo simulations and real datacenter data.
Findings
Ignoring human errors underestimates unavailability by up to 1000x
RAID1 may be less reliable than RAID5 when human errors are considered
Automatic fail-over policies can reduce human error impacts by two orders of magnitude
Abstract
Data storage systems and their availability play a crucial role in contemporary datacenters. Despite using mechanisms such as automatic fail-over in datacenters, the role of human agents and consequently their destructive errors is inevitable. Due to very large number of disk drives used in exascale datacenters and their high failure rates, the disk subsystem in storage systems has become a major source of Data Unavailability (DU) and Data Loss (DL) initiated by human errors. In this paper, we investigate the effect of Incorrect Disk Replacement Service (IDRS) on the availability and reliability of data storage systems. To this end, we analyze the consequences of IDRS in a disk array, and conduct Monte Carlo simulations to evaluate DU and DL during mission time. The proposed modeling framework can cope with a) different storage array configurations and b) Data Object Survivability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Caching and Content Delivery
