TL;DR
This paper introduces DocMMIR, a comprehensive framework and benchmark for multi-modal document retrieval across diverse formats and domains, highlighting limitations of current models and improving CLIP's performance significantly.
Contribution
The paper presents a new multi-modal document retrieval framework, a large-scale cross-domain benchmark, and a tailored training approach that enhances model performance.
Findings
Current SOTA models have limited zero-shot performance on document retrieval.
Training strategies significantly improve retrieval metrics, with a +31% MRR@10 gain.
A large-scale, diverse dataset enables systematic evaluation of multi-modal retrieval methods.
Abstract
The rapid advancement of unsupervised representation learning and large-scale pre-trained vision-language models has significantly improved cross-modal retrieval tasks. However, existing multi-modal information retrieval (MMIR) studies lack a comprehensive exploration of document-level retrieval and suffer from the absence of cross-domain datasets at this granularity. To address this limitation, we introduce DocMMIR, a novel multi-modal document retrieval framework designed explicitly to unify diverse document formats and domains, including Wikipedia articles, scientific papers (arXiv), and presentation slides, within a comprehensive retrieval scenario. We construct a large-scale cross-domain multimodal benchmark, comprising 450K samples, which systematically integrates textual and visual information. Our comprehensive experimental analysis reveals substantial limitations in current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsContrastive Language-Image Pre-training
