MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains
Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li, Hanghang Tong, Yada Zhu, Hendrik Hamann, Jingrui He

TL;DR
MC-Search is a novel benchmark for evaluating and improving multimodal agentic search with long, step-wise reasoning chains, addressing gaps in existing short-chain QA benchmarks and revealing systematic issues in current models.
Contribution
It introduces MC-Search, the first benchmark with long, annotated reasoning chains for multimodal agentic search, and proposes Search-Align, a fine-tuning framework to enhance model reasoning fidelity.
Findings
Benchmark reveals over- and under-retrieval issues.
Models show modality-misaligned planning problems.
Search-Align improves reasoning and retrieval fidelity.
Abstract
With the increasing demand for step-wise, cross-modal, and knowledge-grounded reasoning, multimodal large language models (MLLMs) are evolving beyond the traditional fixed retrieve-then-generate paradigm toward more sophisticated agentic multimodal retrieval-augmented generation (MM-RAG). Existing benchmarks, however, mainly focus on simplified QA with short retrieval chains, leaving adaptive planning and multimodal reasoning underexplored. We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures. Each example specifies sub-questions, retrieval modalities, supporting facts, and intermediate answers, with fidelity ensured by HAVE (Hop-wise Attribution and Verification of Evidence), resulting in 3,333 high-quality examples averaging 3.7 hops. Beyond answer accuracy, MC-Search introduces…
Peer Reviews
Decision·ICLR 2026 Oral
This paper is well-motivated. It tries to solve hard tasks in a new perspective of view, improving retriever. Sufficient baseline retrievers adopted to verify the proposed method to improve the retriever. This work inspires the future direction on solving more challenging tasks like BrowseComp.
1. Previous work has shown that asking LLMs themselves to evaluate the relevance of queries and documents are not that reliable [[1]](https://arxiv.org/abs/2505.21870). Applying Code Agent and CoT Agent is also within this paradigm. It would be better if there are experiments conducted to verify the relevance between inconsistency (among Code and CoT Agent) and query difficulty. 2. The initial CoT agent's decision ($y^{g}_0$) is only overturned if all L discussion groups unanimously disagree. Th
1. this submission is well-prepared, especially in figures, tables, and appendix. 2. the key contribution of this submission is obvious and makes sense. 3. the process metrics are actually needed things in multi-step reasoning tasks. 4. the experiments and analysis are comprehensive and high-quality.
Many related work may help authors to enhance the completeness of the submission: 1. evaluation on robustness of resisting harmful information is also interesting in RAG-based agentic framework [1]. 2. "multi-modality" may also extend to SQL-based database, query-rewriter-based web-search, and even more [2]. 3. token usage (input and output) and the number of retrieval callings are also helpful to enhance the benchmark [3]. [1] Evaluating the Robustness of Retrieval-Augmented Generation to
The paper's primary strengths lie in its thoughtful benchmark design and its focus on process-level evaluation. Structured Benchmark Design: The introduction of five explicit reasoning topologies is a significant contribution. It moves beyond simply creating "long" chains and provides a structured way to diagnose specific model failures (e.g., a model failing on Parallel Forks but succeeding on Linear Chains). This offers a more granular analysis than existing benchmarks. Rigorous Data Curatio
The paper's contributions are undermined by significant weaknesses, primarily an overstatement of novelty and key methodological limitations. Novelty overstatement — The paper's central claim of being the "first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains" is not well-supported. Prior work such as Dyn-VQA benchmark was specifically designed for dynamic, multi-hop questions requiring complex, adaptive retrieval and also introduced a self-adaptive planning agent.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
