Automated Cause Analysis of Latency Outliers Using System-Level Dependency Graphs
Sneh Patel, Brendan Park, Naser Ezzati-Jivan, Quentin Fournier

TL;DR
This paper presents an automated approach for detecting latency outliers and their root causes in system performance using dependency graphs and machine learning, achieving over 97% accuracy.
Contribution
It introduces a novel method combining system-level dependency graphs with machine learning to automate latency outlier detection and root cause analysis.
Findings
Achieves over 97% accuracy in outlier detection
Effective in in-production server environments
Automates root cause identification process
Abstract
Detecting performance issues and identifying their root causes in the runtime is a challenging task. Typically, developers use methods such as logging and tracing to identify bottlenecks. These solutions are, however, not ideal as they are time-consuming and require manual effort. In this paper, we propose a method to automate the task of detecting latency outliers using system-level traces and then comparing them to identify the root cause(s). Our method makes use of dependency graphs to show internal interactions between threads and system resources. With these graphs, one can pinpoint where performance issues occur. However, a single trace can be composed of a large number of requests, each generating one graph. To automate the task of identifying outliers within the dataset, we use machine learning density-based models and statistical calculations such as -score. Our evaluation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
