AutoTSG: Learning and Synthesis for Incident Troubleshooting
Manish Shetty, Chetan Bansal, Sai Pramod Upadhyayula, Arjun, Radhakrishna, Anurag Gupta

TL;DR
AutoTSG is a novel framework that automates Troubleshooting Guides into executable workflows using machine learning and program synthesis, addressing quality gaps and improving incident resolution efficiency.
Contribution
The paper introduces AutoTSG, a new approach combining machine learning and program synthesis to automate and improve troubleshooting guide generation for incident management.
Findings
AutoTSG accurately identifies TSG statements with 0.89 accuracy.
It parses TSG statements for execution with 0.94 precision and 0.91 recall.
Empirical study shows TSGs significantly reduce mitigation efforts.
Abstract
Incident management is a key aspect of operating large-scale cloud services. To aid with faster and efficient resolution of incidents, engineering teams document frequent troubleshooting steps in the form of Troubleshooting Guides (TSGs), to be used by on-call engineers (OCEs). However, TSGs are siloed, unstructured, and often incomplete, requiring developers to manually understand and execute necessary steps. This results in a plethora of issues such as on-call fatigue, reduced productivity, and human errors. In this work, we conduct a large-scale empirical study of over 4K+ TSGs mapped to 1000s of incidents and find that TSGs are widely used and help significantly reduce mitigation efforts. We then analyze feedback on TSGs provided by 400+ OCEs and propose a taxonomy of issues that highlights significant gaps in TSG quality. To alleviate these gaps, we investigate the automation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Business Process Modeling and Analysis · Cloud Computing and Resource Management
