MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Tianhong Gao; Yannian Fu; Weiqun Wu; Haixiao Yue; Shanshan Liu; Gang Zhang

arXiv:2507.21924·cs.CV·July 30, 2025

MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Tianhong Gao, Yannian Fu, Weiqun Wu, Haixiao Yue, Shanshan Liu, Gang Zhang

PDF

1 Datasets

TL;DR

The paper introduces MMAT-1M, a large-scale multimodal agent tuning dataset that enhances multimodal reasoning and tool use in language models through a novel four-stage data generation process.

Contribution

It presents the first million-scale multimodal agent tuning dataset supporting CoT, reflection, and dynamic tool usage, constructed via a novel multi-stage data engine.

Findings

01

Significant performance improvements on public benchmarks.

02

Enhanced multimodal reasoning and tool utilization.

03

Open-source dataset availability.

Abstract

Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

VIS-MPU-Agent/MMAT-1M
dataset· 95 dl
95 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.