M$^3$Searcher: Modular Multimodal Information Seeking Agency with Retrieval-Oriented Reasoning
Xiaohan Yu, Chao Feng, Lang Mei, Chong Chen

TL;DR
M$^3$Searcher is a modular multimodal agent that improves information retrieval and reasoning across complex tasks by decoupling acquisition from answer derivation and using a retrieval-focused reward system.
Contribution
It introduces a novel modular architecture for multimodal search, along with a new dataset and training method to enhance reasoning and retrieval fidelity.
Findings
Outperforms existing multimodal search approaches
Shows strong transferability to new tasks
Demonstrates effective reasoning in complex multimodal scenarios
Abstract
Recent advances in DeepResearch-style agents have demonstrated strong capabilities in autonomous information acquisition and synthesize from real-world web environments. However, existing approaches remain fundamentally limited to text modality. Extending autonomous information-seeking agents to multimodal settings introduces critical challenges: the specialization-generalization trade-off that emerges when training models for multimodal tool-use at scale, and the severe scarcity of training data capturing complex, multi-step multimodal search trajectories. To address these challenges, we propose MSearcher, a modular multimodal information-seeking agent that explicitly decouples information acquisition from answer derivation. MSearcher is optimized with a retrieval-oriented multi-objective reward that jointly encourages factual accuracy, reasoning soundness, and retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Information Retrieval and Search Behavior
