DocMMIR: A Framework for Document Multi-modal Information Retrieval

Zirui Li; Siwei Wu; Yizhi Li; Xingyu Wang; Yi Zhou; Chenghua Lin

arXiv:2505.19312·cs.IR·October 20, 2025

DocMMIR: A Framework for Document Multi-modal Information Retrieval

Zirui Li, Siwei Wu, Yizhi Li, Xingyu Wang, Yi Zhou, Chenghua Lin

PDF

1 Video

TL;DR

This paper introduces DocMMIR, a comprehensive framework and benchmark for multi-modal document retrieval across diverse formats and domains, highlighting limitations of current models and improving CLIP's performance significantly.

Contribution

The paper presents a new multi-modal document retrieval framework, a large-scale cross-domain benchmark, and a tailored training approach that enhances model performance.

Findings

01

Current SOTA models have limited zero-shot performance on document retrieval.

02

Training strategies significantly improve retrieval metrics, with a +31% MRR@10 gain.

03

A large-scale, diverse dataset enables systematic evaluation of multi-modal retrieval methods.

Abstract

The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DocMMIR: A Framework for Document Multi-modal Information Retrieval· underline

Taxonomy

MethodsContrastive Language-Image Pre-training