On Automatic Parsing of Log Records

Jared Rand; Andriy Miranskyy

arXiv:2102.06320·cs.SE·August 4, 2021

On Automatic Parsing of Log Records

Jared Rand, Andriy Miranskyy

PDF

1 Repo

TL;DR

This paper explores automating log record parsing using machine translation models trained on synthetic data, demonstrating promising results in accurately parsing real-world Apache logs with median relative edit distance under 28%.

Contribution

It introduces a novel approach of using machine translation models trained on synthetic logs to automate parsing of heterogeneous log formats.

Findings

01

Models can learn Apache log formats effectively.

02

Median relative edit distance to real logs is ≤ 28%.

03

MT-based parsing shows promising accuracy.

Abstract

Software log analysis helps to maintain the health of software solutions and ensure compliance and security. Existing software systems consist of heterogeneous components emitting logs in various formats. A typical solution is to unify the logs using manually built parsers, which is laborious. Instead, we explore the possibility of automating the parsing task by employing machine translation (MT). We create a tool that generates synthetic Apache log records which we used to train recurrent-neural-network-based MT models. Models' evaluation on real-world logs shows that the models can learn Apache log format and parse individual log records. The median relative edit distance between an actual real-world log record and the MT prediction is less than or equal to 28%. Thus, we show that log parsing using an MT approach is promising.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WulffHunter/log_generator
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsGated Recurrent Unit · Long Short-Term Memory