Live Forensics for Distributed Storage Systems
Saurabh Jha, Shengkun Cui, Tianyin Xu, Jeremy Enos, Mike Showerman,, Mark Dalton, Zbigniew T. Kalbarczyk, William T. Kramer, Ravishankar K. Iyer

TL;DR
Kaleidoscope is a live forensics system designed for large-scale distributed storage systems, enabling rapid root cause analysis of performance issues with minimal overhead.
Contribution
The paper introduces Kaleidoscope, a novel live forensics system that uses differential observability and stochastic modeling to identify root causes in distributed storage environments.
Findings
Pinpoints root causes of 95.8% of issues
Operates with 5-minute interval forensics
Imposes negligible monitoring overhead
Abstract
We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Distributed systems and fault tolerance · Software System Performance and Reliability
