WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?

An-Lan Wang; Jingqun Tang; Liao Lei; Hao Feng; Qi Liu; Xiang Fei; Jinghui Lu; Han Wang; Weiwei Liu; Hao Liu; Yuliang Liu; Xiang Bai; Can Huang

arXiv:2505.11015·cs.CV·May 28, 2025

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?

An-Lan Wang, Jingqun Tang, Liao Lei, Hao Feng, Qi Liu, Xiang Fei, Jinghui Lu, Han Wang, Weiwei Liu, Hao Liu, Yuliang Liu, Xiang Bai, Can Huang

PDF

Open Access 1 Datasets 1 Video

TL;DR

WildDoc introduces a new benchmark for evaluating document understanding in real-world conditions, revealing significant robustness gaps in current multimodal models when faced with natural environment challenges.

Contribution

This paper presents WildDoc, the first benchmark specifically designed to assess the robustness of document understanding models in natural, real-world scenarios.

Findings

01

State-of-the-art MLLMs show substantial performance drops on WildDoc.

02

Current models lack robustness to real-world document distortions.

03

WildDoc highlights the gap between digital and real-world document understanding.

Abstract

The rapid advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced capabilities in Document Understanding. However, prevailing benchmarks like DocVQA and ChartQA predominantly comprise \textit{scanned or digital} documents, inadequately reflecting the intricate challenges posed by diverse real-world scenarios, such as variable illumination and physical distortions. This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments. WildDoc incorporates a diverse set of manually captured document images reflecting real-world conditions and leverages document sources from established benchmarks to facilitate comprehensive comparisons with digital or scanned documents. Further, to rigorously evaluate model robustness, each document is captured four times under different conditions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ByteDance/WildDoc
dataset· 242 dl
242 dl

Videos

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?· underline

Taxonomy

TopicsHandwritten Text Recognition Techniques · Topic Modeling · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training