Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

Renyu Li; Vladimir Kirilenko; Yao You; and Crag Wolfe

arXiv:2604.11042·cs.CV·April 14, 2026

Improving Layout Representation Learning Across Inconsistently Annotated Datasets via Agentic Harmonization

Renyu Li, Vladimir Kirilenko, Yao You, and Crag Wolfe

PDF

TL;DR

The paper introduces an agentic label harmonization method using vision-language models to reconcile annotation inconsistencies across datasets, improving document layout detection performance.

Contribution

It presents a novel harmonization workflow that aligns heterogeneous annotations before training, enhancing model accuracy and representation quality.

Findings

01

Harmonization improves detection F-score from 0.860 to 0.883.

02

Harmonization increases table TEDS to 0.814.

03

Representation analysis shows more compact embeddings after harmonization.

Abstract

Fine-tuning object detection (OD) models on combined datasets assumes annotation compatibility, yet datasets often encode conflicting spatial definitions for semantically equivalent categories. We propose an agentic label harmonization workflow that uses a vision-language model to reconcile both category semantics and bounding box granularity across heterogeneous sources before training. We evaluate on document layout detection as a challenging case study, where annotation standards vary widely across corpora. Without harmonization, na\"ive mixed-dataset fine-tuning degrades a pretrained RT-DETRv2 detector: on SCORE-Bench, which measures how accurately the full document conversion pipeline reproduces ground-truth structure, table TEDS drops from 0.800 to 0.750. Applied to two corpora whose 16 and 10 category taxonomies share only 8 direct correspondences, harmonization yields consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.