MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline
Huanjin Yao, Qixiang Yin, Min Yang, Ziwang Zhao, Yibo Wang, Haotian Luo, Jingyi Zhang, Jiaxing Huang

TL;DR
MM-DeepResearch introduces a multimodal research agent that leverages hypergraph-based QA generation, specialized search tool experts, and an offline search engine to enable deep, multi-tool research tasks without online API costs.
Contribution
The paper presents Hyper-Search, DR-TTS, and an offline search engine, forming a novel framework for multimodal research agents capable of explicit reasoning and tool invocation.
Findings
Outperforms existing benchmarks in multimodal research tasks.
Effectively generates search-intensive multimodal QA pairs.
Successfully conducts complex research tasks with offline search tools.
Abstract
We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis, enabling it to conduct deep research tasks. However, we observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs. To tackle them, we first propose Hyper-Search, a hypergraph-based QA generation method that models and connects visual and textual nodes within and across modalities, enabling to generate search-intensive multimodal QA pairs that require invoking various search tools to solve. Second, we introduce DR-TTS, which first decomposes search-involved tasks into several categories according to search tool types, and respectively optimize specialized search tool experts for each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Information Retrieval and Search Behavior
