DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation
Yongkun Du, Pinxuan Chen, Xuye Ying, Zhineng Chen

TL;DR
DocPTBench is a new benchmark dataset for evaluating multimodal large language models on real-world photographed documents, highlighting significant performance challenges and gaps in current models' robustness under practical capture conditions.
Contribution
It introduces a comprehensive, annotated benchmark dataset specifically for photographed document parsing and translation, addressing limitations of existing benchmarks focused on digital documents.
Findings
MLLMs show an 18% accuracy drop in parsing and 12% in translation on photographed documents.
Specialized document models experience a 25% decrease in performance on real-world images.
Photographed documents pose unique challenges that current models are not robust against.
Abstract
The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
