MMDocIR: Benchmarking Multimodal Retrieval for Long Documents

Kuicai Dong; Yujing Chang; Xin Deik Goh; Dexun Li; Ruiming Tang; Yong Liu

arXiv:2501.08828·cs.IR·November 10, 2025

MMDocIR: Benchmarking Multimodal Retrieval for Long Documents

Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu

PDF

Open Access 1 Models 3 Datasets 1 Video

TL;DR

This paper introduces MMDocIR, a comprehensive benchmark for evaluating multimodal retrieval in long documents, covering page and layout levels, with a rich dataset and experiments showing the superiority of visual and multimodal methods.

Contribution

The work presents the first large-scale benchmark dataset for multimodal long document retrieval, including two tasks and extensive annotations, enabling better evaluation and development of retrieval systems.

Findings

01

Visual retrievers outperform text-based ones.

02

Training with MMDocIR improves retrieval performance.

03

VLM-text based retrievers outperform OCR-text based methods.

Abstract

Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
MMDocIR/MMDocIR_Retrievers
model

Datasets

Videos

MMDocIR: Benchmarking Multimodal Retrieval for Long Documents· underline

Taxonomy

TopicsSemantic Web and Ontologies

MethodsSparse Evolutionary Training