DeepTriage: Automated Transfer Assistance for Incidents in Cloud Services
Phuong Pham, Vivek Jain, Lukas Dauterman, Justin Ormont, Navendu Jain

TL;DR
DeepTriage is an ensemble machine learning system designed to automate incident assignment in cloud services, significantly reducing downtime and improving accuracy in a complex, large-scale environment.
Contribution
This paper introduces DeepTriage, a novel ensemble approach combining multiple ML techniques for automated incident triage in cloud services, addressing key scalability and trust challenges.
Findings
Achieves 82.9% F1 score on real incidents
F1 score ranges from 76.3% to 91.3% for high-impact incidents
Deployed in Azure since 2017, used by thousands of teams daily
Abstract
As cloud services are growing and generating high revenues, the cost of downtime in these services is becoming significantly expensive. To reduce loss and service downtime, a critical primary step is to execute incident triage, the process of assigning a service incident to the correct responsible team, in a timely manner. An incorrect assignment risks additional incident reroutings and increases its time to mitigate by 10x. However, automated incident triage in large cloud services faces many challenges: (1) a highly imbalanced incident distribution from a large number of teams, (2) wide variety in formats of input data or data sources, (3) scaling to meet production-grade requirements, and (4) gaining engineers' trust in using machine learning recommendations. To address these challenges, we introduce DeepTriage, an intelligent incident transfer service combining multiple machine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
