Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

Meng Lu; Ran Xu; Yi Fang; Wenxuan Zhang; Yue Yu; Gaurav Srivastava; Yuchen Zhuang; Mohamed Elhoseiny; Charles Fleming; Carl Yang; Zhengzhong Tu; Yang Xie; Guanghua Xiao; Hanrui Wang; Di Jin; Wenqi Shi; Xuan Wang

arXiv:2511.19773·cs.AI·November 26, 2025

Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs

Meng Lu, Ran Xu, Yi Fang, Wenxuan Zhang, Yue Yu, Gaurav Srivastava, Yuchen Zhuang, Mohamed Elhoseiny, Charles Fleming, Carl Yang, Zhengzhong Tu, Yang Xie, Guanghua Xiao, Hanrui Wang, Di Jin, Wenqi Shi, Xuan Wang

PDF

Open Access 1 Models

TL;DR

This paper introduces VISTA-Gym, a scalable training environment that enhances vision-language models' ability to reason through multi-step visual interactions by integrating tools and reinforcement learning, leading to significant performance improvements.

Contribution

The paper presents VISTA-Gym, a unified platform for training VLMs with tool-integrated reasoning using reinforcement learning, enabling models to better handle complex visual reasoning tasks.

Findings

01

VISTA-R1-8B outperforms state-of-the-art baselines by 9.51%-18.72%.

02

VISTA-Gym effectively trains models for multi-step visual reasoning.

03

Models trained with VISTA-Gym show improved tool use and reasoning capabilities.

Abstract

While recent vision-language models (VLMs) demonstrate strong image understanding, their ability to "think with images", i.e., to reason through multi-step visual interactions, remains limited. We introduce VISTA-Gym, a scalable training environment for incentivizing tool-integrated visual reasoning capabilities in VLMs. VISTA-Gym unifies diverse real-world multimodal reasoning tasks (7 tasks from 13 datasets in total) with a standardized interface for visual tools (e.g., grounding, parsing), executable interaction loops, verifiable feedback signals, and efficient trajectory logging, enabling visual agentic reinforcement learning at scale. While recent VLMs exhibit strong text-only reasoning, both proprietary and open-source models still struggle with tool selection, invocation, and coordination. With VISTA-Gym, we train VISTA-R1 to interleave tool-use with agentic reasoning via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
LuKasatvt/VISTA-Gym
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning