PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao; Shaoheng Lin; Ming Li; Haoquan Zhang; Wenshuo Peng; Kaipeng Zhang; Chen Wei

arXiv:2602.20739·cs.AI·February 25, 2026

PyVision-RL: Forging Open Agentic Vision Models via RL

Shitian Zhao, Shaoheng Lin, Ming Li, Haoquan Zhang, Wenshuo Peng, Kaipeng Zhang, Chen Wei

PDF

Open Access 4 Models 4 Datasets

TL;DR

PyVision-RL introduces a reinforcement learning framework that stabilizes training of multimodal agentic models, promoting sustained multi-turn interactions and efficient visual processing for improved performance.

Contribution

The paper presents PyVision-RL, a novel RL framework that prevents interaction collapse and enhances multimodal models with on-demand visual context sampling.

Findings

01

Enhanced multi-turn reasoning capabilities

02

Reduced visual token usage during inference

03

Improved training stability and interaction sustainability

Abstract

Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Social Robot Interaction and HRI