TL;DR
This paper explores automating log record parsing using machine translation models trained on synthetic data, demonstrating promising results in accurately parsing real-world Apache logs with median relative edit distance under 28%.
Contribution
It introduces a novel approach of using machine translation models trained on synthetic logs to automate parsing of heterogeneous log formats.
Findings
Models can learn Apache log formats effectively.
Median relative edit distance to real logs is ≤ 28%.
MT-based parsing shows promising accuracy.
Abstract
Software log analysis helps to maintain the health of software solutions and ensure compliance and security. Existing software systems consist of heterogeneous components emitting logs in various formats. A typical solution is to unify the logs using manually built parsers, which is laborious. Instead, we explore the possibility of automating the parsing task by employing machine translation (MT). We create a tool that generates synthetic Apache log records which we used to train recurrent-neural-network-based MT models. Models' evaluation on real-world logs shows that the models can learn Apache log format and parse individual log records. The median relative edit distance between an actual real-world log record and the MT prediction is less than or equal to 28%. Thus, we show that log parsing using an MT approach is promising.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsGated Recurrent Unit · Long Short-Term Memory
