Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark
Xiao Liu, Jianfeng Lin, Jiawei Zhang

TL;DR
This paper introduces MultiAPI, a large-scale benchmark dataset for evaluating large language models' multimodal capabilities, revealing strengths in API decision-making but challenges in domain understanding and argument generation.
Contribution
The study presents MultiAPI, a novel comprehensive benchmark dataset designed to evaluate and analyze LLMs' proficiency in multimodal and tool-augmented tasks.
Findings
LLMs are proficient in API call decision-making.
Challenges remain in domain identification and argument generation.
Auxiliary context can impair LLM performance.
Abstract
The proliferation of Large Language Models like ChatGPT has significantly advanced language understanding and generation, impacting a broad spectrum of applications. However, these models predominantly excel in text-based tasks, overlooking the complexity of real-world multimodal information. This study introduces MultiAPI, a pioneering comprehensive large-scale API benchmark dataset aimed at expanding LLMs' proficiency in multimodal contexts. Developed collaboratively through ChatGPT, MultiAPI consists of 235 diverse API calls and 2,038 contextual prompts, offering a unique platform evaluation of tool-augmented LLMs handling multimodal tasks. Through comprehensive experiments, our findings reveal that while LLMs demonstrate proficiency in API call decision-making, they face challenges in domain identification, function selection, and argument generation. What's more, we surprisingly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
