Real5-OmniDocBench: A Full-Scale Physical Reconstruction Benchmark for Robust Document Parsing in the Wild
Changda Zhou, Ziyue Gao, Xueqing Wang, Tingquan Gao, Cheng Cui, Jing Tang, Yi Liu

TL;DR
This paper introduces Real5-OmniDocBench, a comprehensive physical benchmark for evaluating and diagnosing the robustness of document parsing models in real-world scenarios, revealing significant gaps in current model performance.
Contribution
It presents the first full-scale physical reconstruction benchmark for OmniDocBench, enabling detailed analysis of factors affecting document parsing robustness in real-world conditions.
Findings
Models perform significantly worse in real-world scenarios compared to digital benchmarks.
The benchmark allows precise attribution of failure causes to geometric or optical distortions.
The reality gap in document parsing remains substantial, highlighting the need for more resilient models.
Abstract
While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
