On Fault Tolerance of Data Storage Systems: A Holistic Perspective
Mai Zheng, Duo Zhang, Ahmed Dajani

TL;DR
This paper provides a comprehensive overview of the architecture, components, and fault tolerance techniques of modern data storage systems, highlighting challenges and future directions for ensuring data integrity and system resilience.
Contribution
It offers a holistic perspective on fault tolerance in data storage systems, integrating insights across hardware, software, and detection techniques.
Findings
Overview of modern storage system architectures
Discussion of bug detection and fault tolerance techniques
Identification of open challenges and future research directions
Abstract
Data storage systems serve as the foundation of digital society. The enormous data generated by people on a daily basis make the fault tolerance of data storage systems increasingly important. Unfortunately, modern storage systems consist of complicated hardware and software layers interacting with each other, which may contain latent bugs that elude extensive testing and lead to data corruption, system downtime, or even unrecoverable data loss in practice. In this chapter, we take a holistic view to introduce the typical architecture and major components of modern data storage systems (e.g., solid state drives, persistent memories, local file systems, and distributed storage management at scale). Next, we discuss a few representative bug detection and fault tolerance techniques across layers with a focus on issues that affect system recovery and data integrity. Finally, we conclude…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
