What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Yan Ma; Weiyu Zhang; Tianle Li; Linge Du; Xuyang Shen; Pengfei Liu

arXiv:2602.01334·cs.CV·May 22, 2026

What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Yan Ma, Weiyu Zhang, Tianle Li, Linge Du, Xuyang Shen, Pengfei Liu

PDF

1 Datasets

TL;DR

This paper introduces MED, a framework to disentangle intrinsic learning from tool-induced effects in vision RL models, revealing that current methods mainly reduce harm rather than mastering tools.

Contribution

The paper presents MED, a novel analysis framework that separates intrinsic capabilities from tool effects, providing insights into how vision RL models learn to use tools.

Findings

01

Improvements are mainly due to intrinsic learning, not tool mastery.

02

Tool-use RL reduces tool-induced harm, such as errors and schema interference.

03

Current models coexist with tools but do not fully master them.

Abstract

Vision tool-use reinforcement learning (RL) can equip vision language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities. We introduce MED (Measure--Explain--Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses in the crop-and-zoom setting on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Med2026/Med-eval-logs
dataset· 119 dl
119 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics