MMLongBench-Doc: Benchmarking Long-context Document Understanding with   Visualizations

Yubo Ma; Yuhang Zang; Liangyu Chen; Meiqi Chen; Yizhu Jiao; Xinze Li,; Xinyuan Lu; Ziyu Liu; Yan Ma; Xiaoyi Dong; Pan Zhang; Liangming Pan; Yu-Gang; Jiang; Jiaqi Wang; Yixin Cao; Aixin Sun

arXiv:2407.01523·cs.CV·November 13, 2024·1 cites

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li,, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, Pan Zhang, Liangming Pan, Yu-Gang, Jiang, Jiaqi Wang, Yixin Cao, Aixin Sun

PDF

Open Access 1 Repo 2 Models 5 Datasets 1 Video

TL;DR

This paper introduces MMLongBench-Doc, a comprehensive benchmark dataset for evaluating large vision-language models on long, multi-modal documents, revealing current models' significant challenges in understanding lengthy, complex documents.

Contribution

It provides the first long-context, multi-modal benchmark with expert annotations for document understanding, highlighting the limitations of existing LVLMs in handling long, multi-source, multi-page documents.

Findings

01

Current LVLMs perform poorly on long-context document understanding.

02

GPT-4o achieves only 42.7% F1 score on the benchmark.

03

Most models underperform compared to LLMs with OCR data.

Abstract

Understanding documents with rich layouts and multi-modal components is a long-standing and practical task. Recent Large Vision-Language Models (LVLMs) have made remarkable strides in various tasks, particularly in single-page document understanding (DU). However, their abilities on long-context DU remain an open problem. This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. Distinct from previous datasets, it is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Towards comprehensive evaluation, answers to these questions rely on pieces of evidence from (1) different sources (text, image, chart, table, and layout structure) and (2) various locations (i.e. page number). Moreover, 33.2% of the questions are cross-page questions requiring evidence across multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mayubo2333/mmlongbench-doc
pytorch

Models

Datasets

Videos

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies