Failure Identification from Unstable Log Data using Deep Learning
Jasmin Bogatinovski, Sasho Nedelkoski, Li Wu, Jorge Cardoso, Odej Kao

TL;DR
This paper introduces CLog, a deep learning approach that improves failure identification in cloud systems by representing log data as subprocess sequences, effectively handling data instability and incomplete failure coverage.
Contribution
CLog's novel subprocess extraction method reduces log data instability and enhances failure detection and classification accuracy in cloud system logs.
Findings
CLog outperforms baselines with 9-24% higher F1 scores in failure detection.
CLog achieves a 7% improvement in failure type identification.
Subprocess representations mitigate the impact of unstable log data.
Abstract
The reliability of cloud platforms is of significant relevance because society increasingly relies on complex software systems running on the cloud. To improve it, cloud providers are automating various maintenance tasks, with failure identification frequently being considered. The precondition for automation is the availability of observability tools, with system logs commonly being used. The focus of this paper is log-based failure identification. This problem is challenging because of the instability of the log data and the incompleteness of the explicit logging failure coverage within the code. To address the two challenges, we present CLog as a method for failure identification. The key idea presented herein based is on our observation that by representing the log data as sequences of subprocesses instead of sequences of log events, the effect of the unstable log data is reduced.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Cloud Computing and Resource Management · Anomaly Detection Techniques and Applications
