DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services
Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy,, Mathru Janakiraman

TL;DR
DeCaf is an automated system that uses machine learning and pattern mining to diagnose and triage performance issues in large-scale cloud services efficiently, reducing manual effort and customer impact.
Contribution
The paper introduces DeCaf, a novel system that automates diagnosis and triaging of KPI issues in cloud services using logs and machine learning, with successful deployment in Microsoft.
Findings
DeCaf diagnosed 10 known issues and 31 unknown issues in large-scale cloud services.
It effectively triaged issues by leveraging historical data.
The system scales to large log volumes and supports various KPI types.
Abstract
Large scale cloud services use Key Performance Indicators (KPIs) for tracking and monitoring performance. They usually have Service Level Objectives (SLOs) baked into the customer agreements which are tied to these KPIs. Dependency failures, code bugs, infrastructure failures, and other problems can cause performance regressions. It is critical to minimize the time and manual effort in diagnosing and triaging such issues to reduce customer impact. Large volume of logs and mixed type of attributes (categorical, continuous) in the logs makes diagnosis of regressions non-trivial. In this paper, we present the design, implementation and experience from building and deploying DeCaf, a system for automated diagnosis and triaging of KPI issues using service logs. It uses machine learning along with pattern mining to help service owners automatically root cause and triage performance issues.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
