Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation

Yejin Choi; Jaewoo Park; Janghan Yoon; Saejin Kim; Jaehyun Jeon; Youngjae Yu

arXiv:2508.17079·cs.IR·August 26, 2025

Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation

Yejin Choi, Jaewoo Park, Janghan Yoon, Saejin Kim, Jaehyun Jeon, Youngjae Yu

PDF

1 Video

TL;DR

This paper introduces PREMIR, a novel multimodal document retrieval framework that uses cross-modal question generation to improve retrieval performance across unseen domains and languages, outperforming existing methods.

Contribution

The paper presents PREMIR, a new approach leveraging cross-modal pre-questions generated by MLLMs to enhance retrieval accuracy in diverse and unseen multimodal document settings.

Findings

01

Achieves state-of-the-art performance on out-of-distribution benchmarks.

02

Outperforms strong baselines across all retrieval metrics.

03

Demonstrates robustness in real-world, multilingual, and closed-domain scenarios.

Abstract

Rapid advances in Multimodal Large Language Models (MLLMs) have expanded information retrieval beyond purely textual inputs, enabling retrieval from complex real world documents that combine text and visuals. However, most documents are private either owned by individuals or confined within corporate silos and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval. Unlike earlier multimodal retrievers that compare embeddings in a single vector space, PREMIR leverages preQs from multiple complementary modalities to expand the scope of matching to the token level. Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation· underline