WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
An-Lan Wang, Jingqun Tang, Liao Lei, Hao Feng, Qi Liu, Xiang Fei, Jinghui Lu, Han Wang, Weiwei Liu, Hao Liu, Yuliang Liu, Xiang Bai, Can Huang

TL;DR
WildDoc introduces a new benchmark for evaluating document understanding in real-world conditions, revealing significant robustness gaps in current multimodal models when faced with natural environment challenges.
Contribution
This paper presents WildDoc, the first benchmark specifically designed to assess the robustness of document understanding models in natural, real-world scenarios.
Findings
State-of-the-art MLLMs show substantial performance drops on WildDoc.
Current models lack robustness to real-world document distortions.
WildDoc highlights the gap between digital and real-world document understanding.
Abstract
The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise \textit{scanned or digital} documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHandwritten Text Recognition Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
