Visual Reasoning through Tool-supervised Reinforcement Learning

Qihua Dong; Gozde Sahin; Pei Wang; Zhaowei Cai; Robik Shrestha; Hao Yang; and Davide Modolo

arXiv:2604.19945·cs.CV·April 23, 2026

Visual Reasoning through Tool-supervised Reinforcement Learning

Qihua Dong, Gozde Sahin, Pei Wang, Zhaowei Cai, Robik Shrestha, Hao Yang, and Davide Modolo

PDF

TL;DR

This paper introduces ToolsRL, a reinforcement learning framework that enables multimodal models to master visual tools for complex reasoning, improving tool-use capabilities through curriculum training.

Contribution

The paper proposes a novel tool-supervised reinforcement learning approach with a curriculum strategy to enhance visual reasoning in multimodal models.

Findings

01

ToolsRL achieves strong tool-use capabilities in visual reasoning tasks.

02

Curriculum training improves efficiency and effectiveness of tool mastery.

03

The framework effectively trains models to call and utilize visual tools.

Abstract

In this paper, we investigate the problem of how to effectively master tool-use to solve complex visual reasoning tasks for Multimodal Large Language Models. To achieve that, we propose a novel Tool-supervised Reinforcement Learning (ToolsRL) framework, with direct tool supervision for more effective tool-use learning. We focus on a series of simple, native, and interpretable visual tools, including zoom-in, rotate, flip, and draw point/line, whose tool supervision is easy to collect. A reinforcement learning curriculum is developed, where the first stage is solely optimized by a set of well motivated tool-specific rewards, and the second stage is trained with the accuracy targeted rewards while allowing calling tools. In this way, tool calling capability is mastered before using tools to complete visual reasoning tasks, avoiding the potential optimization conflict among those…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.