TL;DR
MARVEL is a unified multimodal retrieval framework that significantly improves reasoning-intensive retrieval performance by combining query expansion, a reasoning-enhanced retriever, and step-by-step reranking.
Contribution
It introduces a novel integrated pipeline that combines LLM-driven query expansion, a reasoning-enhanced retriever, and chain-of-thought reranking, surpassing existing multimodal retrieval methods.
Findings
Achieves 37.9 nDCG@10 on MM-BRIGHT, outperforming previous best by 10.3 points.
Outperforms all baselines in 27 of 29 domains, matching the best in two.
Demonstrates the effectiveness of a unified expand-retrieve-rerank framework for multimodal retrieval.
Abstract
Multimodal retrieval over text corpora remains a fundamental challenge: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, a reasoning-intensive multimodal retrieval benchmark, underperforming strong text-only systems. We argue that effective multimodal retrieval requires three tightly integrated capabilities that existing approaches address only in isolation: expanding the query's latent intent, retrieving with a model trained for complex reasoning, and reranking via explicit step-by-step reasoning over candidates. We introduce \textbf{MARVEL} (\textbf{M}ultimodal \textbf{A}daptive \textbf{R}easoning-intensi\textbf{V}e \textbf{E}xpand-rerank and retrieva\textbf{L}), a unified pipeline that combines LLM-driven query expansion, \textbf{MARVEL-Retriever} -- a reasoning-enhanced dense retriever fine-tuned for complex multimodal queries -- and GPT-4o-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
