AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang; Kaituo Feng; Dongyang Chen; Zhongyu Wang; Zhixun Li; Sicheng Gao; Meng Meng; Xu Zhou; Manyuan Zhang; Yuzhang Shang; Xiangyu Yue

arXiv:2512.16918·cs.CV·April 29, 2026

AdaTooler-V: Adaptive Tool-Use for Images and Videos

Chaoyang Wang, Kaituo Feng, Dongyang Chen, Zhongyu Wang, Zhixun Li, Sicheng Gao, Meng Meng, Xu Zhou, Manyuan Zhang, Yuzhang Shang, Xiangyu Yue

PDF

2 Repos 2 Models 3 Datasets

TL;DR

AdaTooler-V introduces an adaptive visual tool-use approach for multimodal large language models, reducing unnecessary tool invocation and improving reasoning performance across various visual tasks.

Contribution

It proposes a reinforcement learning-based method to enable selective tool-use in MLLMs, with new datasets and state-of-the-art results on multiple benchmarks.

Findings

01

Outperforms existing methods on twelve visual reasoning benchmarks.

02

Achieves 89.8% accuracy on the V* high-resolution benchmark, surpassing GPT-4o and Gemini 1.5 Pro.

03

Effectively reduces unnecessary tool invocation, improving inference efficiency.

Abstract

Recent advances have shown that multimodal large language models (MLLMs) benefit from multimodal interleaved chain-of-thought (CoT) with vision tool interactions. However, existing open-source models often exhibit blind tool-use reasoning patterns, invoking vision tools even when they are unnecessary, which significantly increases inference overhead and degrades model performance. To this end, we propose AdaTooler-V, an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools. First, we introduce AT-GRPO, a reinforcement learning algorithm that adaptively adjusts reward scales based on the Tool Benefit Score of each sample, encouraging the model to invoke tools only when they provide genuine improvements. Moreover, we construct two datasets to support training: AdaTooler-V-CoT-100k for SFT cold start and AdaTooler-V-300k for RL with verifiable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.