MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Chenyu Wang, Weixin Luo, Sixun Dong, Xiaohua Xuan, Zhengxin Li, Lin, Ma, Shenghua Gao

TL;DR
MLLM-Tool enhances large language models with multimodal perception to accurately identify and recommend tools based on visual and auditory instructions, advancing agent systems' understanding and interaction capabilities.
Contribution
The paper introduces MLLM-Tool, integrating multimodal encoders with open-source LLMs to improve tool selection from multi-modal instructions, and provides a new dataset for evaluation.
Findings
MLLM-Tool effectively recommends tools for multimodal instructions.
The dataset includes multiple solutions for the same instruction.
Experimental results demonstrate improved tool selection accuracy.
Abstract
Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs' ability to perceive tool use is limited to a single text query, which may result in ambiguity in understanding the users' real intentions. LLMs are expected to eliminate that by perceiving the information in the visual- or auditory-grounded instructions. Therefore, in this paper, we propose MLLM-Tool, a system incorporating open-source LLMs and multi-modal encoders so that the learned LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model's capability, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsFocus
