TL;DR
HIVE is a novel framework that enhances multimodal retrieval by integrating explicit visual reasoning through LLM-driven hypothesis generation and verification, significantly improving performance on complex queries.
Contribution
Introduces HIVE, a plug-and-play LLM-based framework that iteratively refines multimodal retrieval with visual hypothesis reasoning, achieving state-of-the-art results.
Findings
HIVE improves nDCG@10 from 27.6 to 41.7 on MM-BRIGHT.
HIVE outperforms both text-only and previous multimodal models.
Visual reasoning contributes an 8.5 point gain in retrieval performance.
Abstract
Multimodal retrieval models fail on reasoning-intensive queries where images (diagrams, charts, screenshots) must be deeply integrated with text to identify relevant documents -- the best multimodal model achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming even strong text-only retrievers (32.2). We introduce \textbf{HIVE} (\textbf{H}ypothesis-driven \textbf{I}terative \textbf{V}isual \textbf{E}vidence Retrieval), a plug-and-play framework that injects explicit visual-text reasoning into a retriever via LLMs. HIVE operates in four stages: (1) initial retrieval over the corpus, (2) LLM-based compensatory query synthesis that explicitly articulates visual and logical gaps observed in top- candidates, (3) secondary retrieval with the refined query, and (4) LLM verification and reranking over the union of candidates. Evaluated on the multimodal-to-text track of MM-BRIGHT (2,803…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
