DrP: Meta's Efficient Investigations Platform at Scale
Shubham Somani, Vanish Talwar, Madhura Parikh, Eduardo Hernandez, Jimmy Wang, Shreya Shah, Chinmay Gandhi, Sanjay Sundarajan, Neeru Sharma, Srikanth Kamath, Nitin Gupta, Benjamin Renard, Ohad Yahalom, Chris Davis

TL;DR
DrP is a scalable, automated investigation platform at Meta that significantly reduces incident resolution time and on-call workload by automating complex troubleshooting workflows across diverse domains.
Contribution
The paper introduces DrP, a comprehensive framework with an SDK, backend, and integrations, enabling automated investigations at scale for large organizations.
Findings
Reduced average MTTR by 20% across teams
Deployed at Meta for 5 years, executing 50K analyses daily
Improved on-call productivity significantly
Abstract
Investigations are a significant step in the operational workflows for large scale systems across multiple domains such as services, data, AI/ML, mobile. Investigation processes followed by on-call engineers are often manual or rely on ad-hoc scripts. This leads to inefficient investigations resulting in increased time to mitigate and isolate failures/SLO violations. It also contributes to on-call toil and poor productivity leading to multiple hours/days spent in triaging/debugging incidents. In this paper, we present DrP, an end-to-end framework and system to automate investigations that reduces the mean time to resolve incidents (MTTR) and reduces on-call toil. DrP consists of an expressive and flexible SDK to author investigation playbooks in code (called analyzers), a scalable backend system to execute these automated playbooks, plug-ins to integrate playbooks into mainstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Software Testing and Debugging Techniques · Scientific Computing and Data Management
