MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

Xiangyu Peng; Can Qin; An Yan; Xinyi Yang; Zeyuan Chen; Ran Xu; Chien-Sheng Wu

arXiv:2604.06376·cs.CV·April 9, 2026

MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

Xiangyu Peng, Can Qin, An Yan, Xinyi Yang, Zeyuan Chen, Ran Xu, Chien-Sheng Wu

PDF

TL;DR

This paper introduces MTA-Agent, a multimodal deep search system trained on a large, verified dataset, achieving state-of-the-art results and providing an open recipe for future research.

Contribution

It presents a novel multi-hop, tool-augmented approach with a large-scale, verified dataset and demonstrates improved reasoning and search capabilities in multimodal agents.

Findings

01

Achieves 54.63% average accuracy on six benchmarks.

02

Increases reasoning steps from 2.27 to 4.28 with training.

03

Provides an open dataset and implementation for reproducibility.

Abstract

Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.