VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, and Klara Nahrstedt

TL;DR
VTool-R1 trains vision-language models to generate multimodal reasoning chains by interleaving text and visual steps, improving reasoning accuracy through reinforcement learning with visual tools.
Contribution
It introduces the first framework for training VLMs to produce multimodal chains of thought using reinforcement learning with visual tools, without process supervision.
Findings
Enhanced reasoning performance on visual question answering tasks.
VLMs learn to strategically generate visual reasoning steps.
Open-sourced code for future research in multimodal reasoning.
Abstract
Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with…
Peer Reviews
Decision·ICLR 2026 Poster
1. This work extends the textual reasoning in multimodal understanding to the multimodal reasoning that involves images and text. 2. This work successfully enables VLMs to learn to integrate intermediate visual reasoning steps into text-based chains of thought in the generated response.
1. It is not recommended to use expressions like "the first RFT framework that trains VLMs to generate multimodal chains of thought". I think the expression "first" is questionable. 2. As shown in Figure 1, can only two-step Reasoning be performed during model training? What about during inference after model training? Why wasn't it designed as a reasoning process with a maximum number of rounds? This is more in line with the phenomenon of multiple iterations in multimodal reasoning. 3. The expe
1. VTool-R1 is the first research work to demonstrate RL can train VLMs to interleave visual steps within a chain of thought using python-based visual editing tools. 2. Strong, controlled experiments and comparisons with state-of-the-art baselines, including commercial and open-source models, show clear improvement and robust methodology. 3. The framework and training design have potential to generalize to more diverse tools and reasoning tasks. 4. Explanations are accessible, with step-by-step
1. Single-Turn Tool Use: The model is only trained/tested with one round of tool invocation, limiting application in multi-step, interactive reasoning. 2. Limited Task Scope: The experiments focus only on structured chart and table reasoning with a simple, small, predefined set of visual editing tools. 3. Lack of Extensive Qualitative Comparison: The paper would benefit from further qualitative comparison.
This paper presents a clear and technically solid framework that adapts reinforcement finetuning (RFT) to multimodal reasoning. I like that the authors focus on the gap between text-only reasoning and truly visual reasoning, and they provide a well-defined approach—integrating Python-based visual tools into the RL loop—to address it. The method is implemented cleanly and described in enough detail to be reproducible. The experiments, though limited in scope, are consistent and show that the mode
From my perspective, the contribution feels somewhat incremental—the method mainly extends DeepSeek-R1–style RFT to VLMs without introducing new algorithmic ideas. The evaluation is narrow, focusing only on structured tasks like chart and table reasoning, which limits how convincing the results are for general multimodal reasoning. The reward design depends on an LLM-based judge, which is subjective, and the tool-use evaluation metric (simply checking if Python runs) doesn’t truly measure reason
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Industrial Vision Systems and Defect Detection · Advanced Neural Network Applications
