Semantic-Aware Parsing for Security Logs
Julien Piet, Vivian Fang, Rishi Khare, Scott Coull, Vern Paxson, Raluca Ada Popa, David Wagner

TL;DR
Matryoshka uses large language models to automatically generate semantic log parsers from unlabeled security logs, enabling scalable, accurate, and automated security log analysis without manual parser creation.
Contribution
The paper introduces Matryoshka, a system that automatically infers and generates security log parsers using LLMs without labeled data, improving scalability and reducing manual effort.
Findings
Outperforms prior syntax parsing methods on benchmark datasets.
Matches human-generated parsers in security query retrieval tasks.
Reduces manual effort in security log data normalization.
Abstract
Security logs are foundational to threat detection and post-incident investigation, yet analysts often struggle to fully leverage them due to their heterogeneity and unstructured nature. The standard practice of manually writing parsers to normalize the data in security event management systems is time-consuming and costly due to the long tail of log formats. Meanwhile, querying raw logs without explicit parsing using large language models (LLMs) is impractical at scale. In this paper, we introduce Matryoshka, an end-to-end system leveraging LLMs to automatically generate semantically-aware structured log parsers without labeled examples or human intervention. Matryoshka achieves this by directly inferring log syntax, variable naming, and normalization to common security-specific schemas (e.g., OCSF [1]) from unlabeled log line samples, then generating deterministic parsers and mapping…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Network Security and Intrusion Detection · Data Visualization and Analytics
