TL;DR
BRIDGE is a system that improves multimodal-to-text retrieval by aligning queries through reinforcement learning and reasoning-enhanced retrieval, outperforming existing methods.
Contribution
The paper introduces FORGE and LENS components that address query mismatch without multimodal encoders, significantly boosting retrieval performance.
Findings
BRIDGE surpasses all multimodal encoder baselines on MM-BRIGHT.
Applying FORGE as a plug-and-play aligner improves retrieval beyond text-only baselines.
Query alignment is identified as the key bottleneck in multimodal-to-text retrieval.
Abstract
Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
