MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
Fengbin Zhu, Ziyang Liu, Xiang Yao Ng, Haohui Wu, Wenjie Wang, Fuli, Feng, Chao Wang, Huanbo Luan, Tat Seng Chua

TL;DR
This paper introduces MMDocBench, a comprehensive benchmark for evaluating large vision-language models on fine-grained understanding of document images across various tasks and document types, addressing limitations of existing benchmarks.
Contribution
It presents MMDocBench, a new benchmark with diverse OCR-free document understanding tasks, and provides extensive evaluation of 16 LVLMs on these tasks, highlighting their strengths and weaknesses.
Findings
LVLMs show varied performance across document types and tasks.
MMDocBench covers 15 tasks with over 4,300 QA pairs.
Benchmark and evaluation code will be publicly available.
Abstract
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks, yet their capabilities in fine-grained visual understanding remain insufficiently evaluated. Existing benchmarks either contain limited fine-grained evaluation samples that are mixed with other data, or are confined to object-level assessments in natural images. To holistically assess LVLMs' fine-grained visual understanding capabilities, we propose using document images with multi-granularity and multi-modal information to supplement natural images. In this light, we construct MMDocBench, a benchmark with various OCR-free document understanding tasks for the evaluation of fine-grained visual perception and reasoning abilities. MMDocBench defines 15 main tasks with 4,338 QA pairs and 11,353 supporting regions, covering various document images such as research papers, receipts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Video Analysis and Summarization
