SchemaCoder: Automatic Log Schema Extraction Coder with Residual Q-Tree Boosting

Lily Jiaxin Wan; Chia-Tung Ho; Rongjian Liang; Cunxi Yu; Deming Chen; Haoxing Ren

arXiv:2508.18554·cs.AI·August 27, 2025

SchemaCoder: Automatic Log Schema Extraction Coder with Residual Q-Tree Boosting

Lily Jiaxin Wan, Chia-Tung Ho, Rongjian Liang, Cunxi Yu, Deming Chen, Haoxing Ren

PDF

TL;DR

SchemaCoder is an innovative fully automated framework that leverages a novel Residual Q-Tree Boosting mechanism and LLMs to extract log schemas across diverse formats without human intervention, significantly outperforming existing methods.

Contribution

It introduces the first fully automated log schema extraction framework using a Residual Q-Tree Boosting approach that eliminates the need for human-designed regular expressions.

Findings

01

Achieves 21.3% improvement over state-of-the-art methods on LogHub-2.0.

02

Effectively extracts schemas from diverse log formats without human customization.

03

Demonstrates robustness and scalability across large log datasets.

Abstract

Log schema extraction is the process of deriving human-readable templates from massive volumes of log data, which is essential yet notoriously labor-intensive. Recent studies have attempted to streamline this task by leveraging Large Language Models (LLMs) for automated schema extraction. However, existing methods invariably rely on predefined regular expressions, necessitating human domain expertise and severely limiting productivity gains. To fundamentally address this limitation, we introduce SchemaCoder, the first fully automated schema extraction framework applicable to a wide range of log file formats without requiring human customization within the flow. At its core, SchemaCoder features a novel Residual Question-Tree (Q-Tree) Boosting mechanism that iteratively refines schema extraction through targeted, adaptive queries driven by LLMs. Particularly, our method partitions logs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.