Improving Problem Identification via Automated Log Clustering using Dimensionality Reduction
Carl Martin Rosenberg, Leon Moonen

TL;DR
This paper investigates how dimensionality reduction techniques like NMF improve automated log clustering for problem identification in continuous deployment logs, demonstrating enhanced accuracy and robustness over previous methods.
Contribution
It extends prior log clustering approaches to continuous deployment logs, evaluating the impact of various dimensionality reduction and cluster merging techniques on clustering quality.
Findings
NMF significantly improves clustering accuracy and robustness.
Complete Linkage yields the best clustering performance.
Dimensionality reduction increases pipeline robustness and reduces parameter sensitivity.
Abstract
Goal: We consider the problem of automatically grouping logs of runs that failed for the same underlying reasons, so that they can be treated more effectively, and investigate the following questions: (1) Does an approach developed to identify problems in system logs generalize to identifying problems in continuous deployment logs? (2) How does dimensionality reduction affect the quality of automated log clustering? (3) How does the criterion used for merging clusters in the clustering algorithm affect clustering quality? Method: We replicate and extend earlier work on clustering system log files to assess its generalization to continuous deployment logs. We consider the optional inclusion of one of these dimensionality reduction techniques: Principal Component Analysis (PCA), Latent Semantic Indexing (LSI), and Non-negative Matrix Factorization (NMF). Moreover, we consider three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
