ESRO: Experience Assisted Service Reliability against Outages
Sarthak Chakraborty, Shubham Agarwal, Shaddy Garg, Abhimanyu Sethia,, Udit Narayan Pandey, Videh Aggarwal, Shiv Saini

TL;DR
ESRO is a diagnostic system that combines structured alerts and semi-structured outage reports to improve root cause analysis and remediation recommendations for cloud service failures, demonstrating significant accuracy improvements.
Contribution
This work introduces a novel method to merge causal and knowledge graphs for outage diagnosis, leveraging both data sources systematically for the first time.
Findings
27% improvement in rouge scores over baselines
Effective in real cloud outage scenarios
Utilizes a unified graph for better root cause ranking
Abstract
Modern cloud services are prone to failures due to their complex architecture, making diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging multiple sources of data, including the alerts, error logs, and domain expertise through past experiences to locate the root cause(s). These experiences are documented as natural language text in outage reports for previous outages. However, utilizing the raw yet rich semi-structured information in the reports systematically is time-consuming. Structured information, on the other hand, such as alerts that are often used during fault diagnosis, is voluminous and requires expert knowledge to discern. Several strategies have been proposed to use each source of data separately for root cause analysis. In this work, we build a diagnostic service called ESRO that recommends root causes and remediation for failures by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Reliability and Analysis Research · Software Engineering Research
