MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li,, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang, Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun

TL;DR
This paper introduces MMLongBench-Doc, a comprehensive benchmark dataset for evaluating large vision-language models on long, multi-modal documents, revealing current models' significant challenges in understanding lengthy, complex documents.
Contribution
It provides the first long-context, multi-modal benchmark with expert annotations for document understanding, highlighting the limitations of existing LVLMs in handling long, multi-source, multi-page documents.
Findings
Current LVLMs perform poorly on long-context document understanding.
GPT-4o achieves only 42.7% F1 score on the benchmark.
Most models underperform compared to LLMs with OCR data.
Abstract
Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
