HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

Mahmoud Abdalla; Mahmoud SalahEldin Kasem; Mohamed Mahmoud; Mostafa Farouk Senussi; Abdelrahman Abdallah; Hyun-Soo Kang

arXiv:2604.07220·cs.IR·April 9, 2026

HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi, Abdelrahman Abdallah, Hyun-Soo Kang

PDF

1 Repo

TL;DR

HIVE is a novel framework that enhances multimodal retrieval by integrating explicit visual reasoning through LLM-driven hypothesis generation and verification, significantly improving performance on complex queries.

Contribution

Introduces HIVE, a plug-and-play LLM-based framework that iteratively refines multimodal retrieval with visual hypothesis reasoning, achieving state-of-the-art results.

Findings

01

HIVE improves nDCG@10 from 27.6 to 41.7 on MM-BRIGHT.

02

HIVE outperforms both text-only and previous multimodal models.

03

Visual reasoning contributes an 8.5 point gain in retrieval performance.

Abstract

Multimodal retrieval models fail on reasoning-intensive queries where images (diagrams, charts, screenshots) must be deeply integrated with text to identify relevant documents -- the best multimodal model achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming even strong text-only retrievers (32.2). We introduce \textbf{HIVE} (\textbf{H}ypothesis-driven \textbf{I}terative \textbf{V}isual \textbf{E}vidence Retrieval), a plug-and-play framework that injects explicit visual-text reasoning into a retriever via LLMs. HIVE operates in four stages: (1) initial retrieval over the corpus, (2) LLM-based compensatory query synthesis that explicitly articulates visual and logical gaps observed in top- $k$ candidates, (3) secondary retrieval with the refined query, and (4) LLM verification and reranking over the union of candidates. Evaluated on the multimodal-to-text track of MM-BRIGHT (2,803…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mm-bright/multimodal-reasoning-retrieval
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.