LogRCA: Log-based Root Cause Analysis for Distributed Services
Thorsten Wittkopp, Philipp Wiesner, Odej Kao

TL;DR
LogRCA is a new semi-supervised method that efficiently identifies root causes in large-scale log data, outperforming existing approaches especially for rare failures, thereby aiding faster troubleshooting in complex distributed systems.
Contribution
We introduce LogRCA, a semi-supervised learning approach for root cause analysis that handles noisy data and rare errors, with demonstrated superior performance on large-scale production logs.
Findings
Outperforms deep learning and statistical baselines in precision and recall.
Data balancing significantly improves detection of rare failures.
Effective on large-scale logs with 44.3 million entries and 80 labeled failures.
Abstract
To assist IT service developers and operators in managing their increasingly complex service landscapes, there is a growing effort to leverage artificial intelligence in operations. To speed up troubleshooting, log anomaly detection has received much attention in particular, dealing with the identification of log events that indicate the reasons for a system failure. However, faults often propagate extensively within systems, which can result in a large number of anomalies being detected by existing approaches. In this case, it can remain very challenging for users to quickly identify the actual root cause of a failure. We propose LogRCA, a novel method for identifying a minimal set of log lines that together describe a root cause. LogRCA uses a semi-supervised learning approach to deal with rare and unknown errors and is designed to handle noisy data. We evaluated our approach on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services
Methodstravel james · Sparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
