Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization
Renyu Li, Vladimir Kirilenko, Yao You, and Crag Wolfe

TL;DR
The paper introduces an agentic label harmonization method using vision-language models to reconcile annotation inconsistencies across datasets, improving document layout detection performance.
Contribution
It presents a novel harmonization workflow that aligns heterogeneous annotations before training, enhancing model accuracy and representation quality.
Findings
Harmonization improves detection F-score from 0.860 to 0.883.
Harmonization increases table TEDS to 0.814.
Representation analysis shows more compact embeddings after harmonization.
Abstract
Fine-tuning object detection (OD) models on combined datasets assumes annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, na\"ive mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only 8 direct correspondences, harmonization yields consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
