DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

Yongkun Du; Pinxuan Chen; Xuye Ying; Zhineng Chen

arXiv:2511.18434·cs.CV·November 25, 2025

DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

Yongkun Du, Pinxuan Chen, Xuye Ying, Zhineng Chen

PDF

Open Access 1 Datasets

TL;DR

DocPTBench is a new benchmark dataset for evaluating multimodal large language models on real-world photographed documents, highlighting significant performance challenges and gaps in current models' robustness under practical capture conditions.

Contribution

It introduces a comprehensive, annotated benchmark dataset specifically for photographed document parsing and translation, addressing limitations of existing benchmarks focused on digital documents.

Findings

01

MLLMs show an 18% accuracy drop in parsing and 12% in translation on photographed documents.

02

Specialized document models experience a 25% decrease in performance on real-world images.

03

Photographed documents pose unique challenges that current models are not robust against.

Abstract

The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

topdu/DocPTBench
dataset· 204 dl
204 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Handwritten Text Recognition Techniques