BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

Mohamed Darwish Mounis; Mohamed Mahmoud; Shaimaa Sedek; Mahmoud Abdalla; Mahmoud SalahEldin Kasem; Abdelrahman Abdallah; Hyun-Soo Kang

arXiv:2604.07201·cs.IR·April 9, 2026

BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Abdelrahman Abdallah, Hyun-Soo Kang

PDF

1 Repo

TL;DR

BRIDGE is a system that improves multimodal-to-text retrieval by aligning queries through reinforcement learning and reasoning-enhanced retrieval, outperforming existing methods.

Contribution

The paper introduces FORGE and LENS components that address query mismatch without multimodal encoders, significantly boosting retrieval performance.

Findings

01

BRIDGE surpasses all multimodal encoder baselines on MM-BRIGHT.

02

Applying FORGE as a plug-and-play aligner improves retrieval beyond text-only baselines.

03

Query alignment is identified as the key bottleneck in multimodal-to-text retrieval.

Abstract

Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mm-bright/multimodal-reasoning-retrieval
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.