SchemaCoder: Automatic Log Schema Extraction Coder with Residual Q-Tree Boosting
Lily Jiaxin Wan, Chia-Tung Ho, Rongjian Liang, Cunxi Yu, Deming Chen, Haoxing Ren

TL;DR
SchemaCoder is an innovative fully automated framework that leverages a novel Residual Q-Tree Boosting mechanism and LLMs to extract log schemas across diverse formats without human intervention, significantly outperforming existing methods.
Contribution
It introduces the first fully automated log schema extraction framework using a Residual Q-Tree Boosting approach that eliminates the need for human-designed regular expressions.
Findings
Achieves 21.3% improvement over state-of-the-art methods on LogHub-2.0.
Effectively extracts schemas from diverse log formats without human customization.
Demonstrates robustness and scalability across large log datasets.
Abstract
Log schema extraction is the process of deriving human-readable templates from massive volumes of log data, which is essential yet notoriously labor-intensive. Recent studies have attempted to streamline this task by leveraging Large Language Models (LLMs) for automated schema extraction. However, existing methods invariably rely on predefined regular expressions, necessitating human domain expertise and severely limiting productivity gains. To fundamentally address this limitation, we introduce SchemaCoder, the first fully automated schema extraction framework applicable to a wide range of log file formats without requiring human customization within the flow. At its core, SchemaCoder features a novel Residual Question-Tree (Q-Tree) Boosting mechanism that iteratively refines schema extraction through targeted, adaptive queries driven by LLMs. Particularly, our method partitions logs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
