Reasoning-Augmented Representations for Multimodal Retrieval
Jianrui Zhang, Anirudh Sundara Rajan, Brandon Han, Soochahn Lee, Sukanta Ganguly, Yong Jae Lee

TL;DR
This paper introduces a reasoning-augmented framework for multimodal retrieval that externalizes reasoning to improve performance on complex, knowledge-intensive, and compositional queries by densely captioning visual evidence and rewriting queries.
Contribution
It proposes a data-centric approach that externalizes reasoning in multimodal retrieval, enhancing representations with dense captions and semantic rewriting to address query ambiguity and compositionality.
Findings
Consistent performance improvements on M-BEIR benchmark.
Corpus enhancement benefits knowledge-intensive queries.
Query enhancement improves handling of compositional modifications.
Abstract
Universal Multimodal Retrieval (UMR) seeks any-to-any search across text and vision, yet modern embedding models remain brittle when queries require latent reasoning (e.g., resolving underspecified references or matching compositional constraints). We argue this brittleness is often data-induced: when images carry "silent" evidence and queries leave key semantics implicit, a single embedding pass must both reason and compress, encouraging spurious feature matching. We propose a data-centric framework that decouples these roles by externalizing reasoning before retrieval. Using a strong Vision--Language Model, we make implicit semantics explicit by densely captioning visual evidence in corpus entries, resolving ambiguous multimodal references in queries, and rewriting verbose instructions into concise retrieval constraints. Inference-time enhancement alone is insufficient; the retriever…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Information Retrieval and Search Behavior
