Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Gengluo Li; Pengyuan Lyu; Chengquan Zhang; Huawen Shen; Liang Wu; Xingyu Wan; Gangyan Zeng; Han Hu; Can Ma; Yu Zhou

arXiv:2603.23885·cs.CV·April 21, 2026

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Gengluo Li, Pengyuan Lyu, Chengquan Zhang, Huawen Shen, Liang Wu, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou

PDF

1 Datasets

TL;DR

This paper introduces a co-designed framework combining realistic scene synthesis and document-aware training to improve end-to-end document parsing robustness, especially in real-world scenarios.

Contribution

It proposes a novel data-training co-design approach with a large-scale synthetic dataset and structure-aware training strategies for robust document parsing.

Findings

01

Achieves superior accuracy across diverse document scenarios

02

Demonstrates robustness on real-world captured documents

03

Provides publicly available models and benchmarks

Abstract

Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

VirtualLUO/Wild_OmniDocBench
dataset· 101 dl
101 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.