MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

Meng Lu; Yuxing Lu; Yuchen Zhuang; Megan Mullins; Yang Xie; Guanghua Xiao; Charles Fleming; Wenqi Shi; Xuan Wang

arXiv:2601.07107·cs.CV·January 13, 2026

MEDVISTAGYM: A Scalable Training Environment for Thinking with Medical Images via Tool-Integrated Reinforcement Learning

Meng Lu, Yuxing Lu, Yuchen Zhuang, Megan Mullins, Yang Xie, Guanghua Xiao, Charles Fleming, Wenqi Shi, Xuan Wang

PDF

Open Access

TL;DR

This paper introduces MedVistaGym, a training environment that enhances medical vision language models by enabling multi-step, tool-integrated reasoning for medical image analysis, significantly improving performance on medical VQA benchmarks.

Contribution

We present MedVistaGym, a scalable, interactive training environment that trains models to effectively select and use tools for multi-modal medical reasoning, a capability lacking in open-source VLMs.

Findings

01

MedVistaGym-R1-8B outperforms baselines by 19.10% to 24.21% on six medical VQA benchmarks.

02

Structured agentic training significantly improves tool-integrated reasoning.

03

The environment enables models to localize relevant image regions and coordinate tool use effectively.

Abstract

Vision language models (VLMs) achieve strong performance on general image understanding but struggle to think with medical images, especially when performing multi-step reasoning through iterative visual interaction. Medical VLMs often rely on static visual embeddings and single-pass inference, preventing models from re-examining, verifying, or refining visual evidence during reasoning. While tool-integrated reasoning offers a promising path forward, open-source VLMs lack the training infrastructure to learn effective tool selection, invocation, and coordination in multi-modal medical reasoning. We introduce MedVistaGym, a scalable and interactive training environment that incentivizes tool-integrated visual reasoning for medical image analysis. MedVistaGym equips VLMs to determine when and which tools to invoke, localize task-relevant image regions, and integrate single or multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis