MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Chenyu Wang; Weixin Luo; Sixun Dong; Xiaohua Xuan; Zhengxin Li; Lin; Ma; Shenghua Gao

arXiv:2401.10727·cs.CV·April 14, 2025·1 cites

MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin, Ma, Shenghua Gao

PDF

Open Access 2 Repos

TL;DR

MLLM-Tool enhances large language models with multimodal perception to accurately identify and recommend tools based on visual and auditory instructions, advancing agent systems' understanding and interaction capabilities.

Contribution

The paper introduces MLLM-Tool, integrating multimodal encoders with open-source LLMs to improve tool selection from multi-modal instructions, and provides a new dataset for evaluation.

Findings

01

MLLM-Tool effectively recommends tools for multimodal instructions.

02

The dataset includes multiple solutions for the same instruction.

03

Experimental results demonstrate improved tool selection accuracy.

Abstract

Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs' ability to perceive tool use is limited to a single text query, which may result in ambiguity in understanding the users' real intentions. LLMs are expected to eliminate that by perceiving the information in the visual- or auditory-grounded instructions. Therefore, in this paper, we propose MLLM-Tool, a system incorporating open-source LLMs and multi-modal encoders so that the learned LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model's capability, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems

MethodsFocus