Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool   Usage

Zhi Gao; Bofei Zhang; Pengxiang Li; Xiaojian Ma; Tao Yuan; Yue Fan,; Yuwei Wu; Yunde Jia; Song-Chun Zhu; Qing Li

arXiv:2412.15606·cs.AI·February 4, 2025

Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage

Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaojian Ma, Tao Yuan, Yue Fan,, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces a multi-modal agent tuning approach that automatically generates training data for vision-language models, significantly improving their ability to reason about and utilize external tools in practical tasks.

Contribution

The paper presents a novel data synthesis pipeline and a tuning method for VLMs, enhancing multi-modal agent tool-usage reasoning with 20K synthesized task trajectories.

Findings

01

T3-Agent outperforms untrained VLMs by 20% on benchmarks.

02

The data synthesis pipeline improves tool-usage capabilities.

03

Enhanced VLMs show better reasoning in practical tasks.

Abstract

The advancement of large language models (LLMs) prompts the development of multi-modal agents, which are used as a controller to call external tools, providing a feasible way to solve practical tasks. In this paper, we propose a multi-modal agent tuning method that automatically generates multi-modal tool-usage data and tunes a vision-language model (VLM) as the controller for powerful tool-usage reasoning. To preserve the data quality, we prompt the GPT-4o mini model to generate queries, files, and trajectories, followed by query-file and trajectory verifiers. Based on the data synthesis pipeline, we collect the MM-Traj dataset that contains 20K tasks with trajectories of tool usage. Then, we develop the T3-Agent via \underline{T}rajectory \underline{T}uning on VLMs for \underline{T}ool usage using MM-Traj. Evaluations on the GTA and GAIA benchmarks show that the T3-Agent consistently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
PengxiangLi/MAT
model· 6 dl· ♡ 2
6 dl♡ 2

Datasets

PengxiangLi/MAT
dataset· 49 dl
49 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms · Multi-Agent Systems and Negotiation · Semantic Web and Ontologies