VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining
Xuanyu Zhu, Yuhao Dong, Rundong Wang, Yang Shi, Zhipeng Wu, Yinlun Peng, YiFan Zhang, Yihang Lou, Yuanxing Zhang, Ziwei Liu, Yan Bai, Yuan Zhou

TL;DR
VTC-Bench is a comprehensive benchmark designed to evaluate multimodal models' ability to compose and execute diverse visual tools in complex, multi-step tasks, revealing current limitations and guiding future improvements.
Contribution
The paper introduces VTC-Bench, a new benchmark with 32 visual operations and 680 problems to assess tool-use proficiency and multi-tool composition in multimodal large language models.
Findings
Current models struggle with diverse tool-sets and unseen operations.
Models have difficulty formulating efficient multi-step execution plans.
Leading model achieves only 51% on the benchmark.
Abstract
Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
