VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Xuanyu Zhu; Yuhao Dong; Rundong Wang; Yang Shi; Zhipeng Wu; Yinlun Peng; YiFan Zhang; Yihang Lou; Yuanxing Zhang; Ziwei Liu; Yan Bai; Yuan Zhou

arXiv:2603.15030·cs.AI·March 20, 2026

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Xuanyu Zhu, Yuhao Dong, Rundong Wang, Yang Shi, Zhipeng Wu, Yinlun Peng, YiFan Zhang, Yihang Lou, Yuanxing Zhang, Ziwei Liu, Yan Bai, Yuan Zhou

PDF

Open Access 1 Datasets

TL;DR

VTC-Bench is a comprehensive benchmark designed to evaluate multimodal models' ability to compose and execute diverse visual tools in complex, multi-step tasks, revealing current limitations and guiding future improvements.

Contribution

The paper introduces VTC-Bench, a new benchmark with 32 visual operations and 680 problems to assess tool-use proficiency and multi-tool composition in multimodal large language models.

Findings

01

Current models struggle with diverse tool-sets and unseen operations.

02

Models have difficulty formulating efficient multi-step execution plans.

03

Leading model achieves only 51% on the benchmark.

Abstract

Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zzzhu/VTC-Bench
dataset· 1.0k dl
1.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling