Empowering Multimodal LLMs with External Tools: A Comprehensive Survey

Wenbin An; Jiahao Nie; Yaqiang Wu; Feng Tian; Shijian Lu; Qinghua Zheng

arXiv:2508.10955·cs.CV·August 18, 2025

Empowering Multimodal LLMs with External Tools: A Comprehensive Survey

Wenbin An, Jiahao Nie, Yaqiang Wu, Feng Tian, Shijian Lu, Qinghua Zheng

PDF

TL;DR

This survey explores how integrating external tools like APIs and knowledge bases can significantly enhance the performance, data quality, evaluation, and future development of Multimodal Large Language Models (MLLMs).

Contribution

It provides a comprehensive overview of leveraging external tools to improve MLLMs across data acquisition, task performance, evaluation, and future research directions.

Findings

01

External tools can improve multimodal data quality and annotation.

02

Tool integration enhances MLLM performance on complex tasks.

03

The survey identifies current limitations and future opportunities for tool-augmented MLLMs.

Abstract

By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.