How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

Zhiheng Li; Zongyang Ma; Jiaxian Chen; Jianing Zhang; Zhaolong Su; Yutong Zhang; Zhiyin Yu; Ruiqi Liu; Xiaolei Lv; Bo Li; Jun Gao; Ziqi Zhang; Chunfeng Yuan; Bing Li; Weiming Hu

arXiv:2605.07492·cs.CV·May 11, 2026

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

Zhiheng Li, Zongyang Ma, Jiaxian Chen, Jianing Zhang, Zhaolong Su, Yutong Zhang, Zhiyin Yu, Ruiqi Liu, Xiaolei Lv, Bo Li, Jun Gao, Ziqi Zhang, Chunfeng Yuan, Bing Li, Weiming Hu

PDF

1 Datasets

TL;DR

PureDocBench introduces a comprehensive, source-traceable benchmark for document parsing that reveals the field is still far from solved, with significant performance gaps and robustness issues across models.

Contribution

The paper presents PureDocBench, a new benchmark with verifiable annotations across multiple domains and degradation types, addressing limitations of existing datasets and enabling more reliable evaluation.

Findings

01

The best model scores only around 74 out of 100, indicating parsing is still unsolved.

02

Specialist parsers with fewer parameters outperform larger general VLMs in many cases.

03

Degradation impacts model performance differently, with general VLMs being more robust than pipeline specialists.

Abstract

The past year has seen over 20 open-source document parsing models, yet thefield still benchmarks almost exclusively on OmniDocBench, a 1,355-pagemanually annotated dataset whose top scores have saturated above 90%. Athree-stage audit pipeline we run on OmniDocBench screens its 21,353evaluator-scored blocks and confirms 2,580 errors (12.08%); combined with overa year of public availability, both annotation quality and contamination riskcall its rankings into question. To address these issues, we presentPureDocBench, a programmatically generated, source-traceable benchmark thatrenders document images from HTML/CSS and produces verifiable annotations fromthe same source, covering 10 domains, 66 subcategories, and 1,475 pages, eachin three versions: clean, digitally degraded, and real-degraded (4,425 imagestotal). Evaluating 40 models spanning pipeline specialists, end-to-endspecialists,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zhihengli-casia/puredocbench
dataset· 627 dl
627 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.