MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu; Yuchen Zhuang; Yishan Zhong; Yue Yu; Zifeng Wang; Xiangru Tang; Hang Wu; May D. Wang; Peifeng Ruan; Donghan Yang; Tao Wang; Guanghua Xiao; Xin Liu; Carl Yang; Yang Xie; Wenqi Shi

arXiv:2506.04405·cs.CL·October 7, 2025

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

Ran Xu, Yuchen Zhuang, Yishan Zhong, Yue Yu, Zifeng Wang, Xiangru Tang, Hang Wu, May D. Wang, Peifeng Ruan, Donghan Yang, Tao Wang, Guanghua Xiao, Xin Liu, Carl Yang, Yang Xie, Wenqi Shi

PDF

Open Access 3 Reviews

TL;DR

MedAgentGym is a scalable, interactive environment with extensive biomedical tasks that improves LLMs' coding reasoning, demonstrating significant performance gains and serving as a cost-effective training platform.

Contribution

Introduces MedAgentGym, a comprehensive, scalable platform with real-world biomedical tasks for training and benchmarking LLMs in biomedical reasoning.

Findings

01

MedAgentGym contains 72,413 tasks across 129 categories.

02

Performance improvements of +43.02% and +45.28% achieved through reinforcement learning.

03

Benchmarking reveals performance disparities between commercial and open-source LLMs.

Abstract

We introduce MedAgentGym, a scalable and interactive training environment designed to enhance coding-based biomedical reasoning capabilities in large language model (LLM) agents. MedAgentGym comprises 72,413 task instances across 129 categories derived from 12 authentic real-world biomedical scenarios. Tasks are encapsulated within executable sandbox environments, each featuring detailed task specifications, interactive feedback mechanisms, verifiable ground truth annotations, and scalable training trajectory generation. Extensive benchmarking of 29 LLMs reveals substantial performance disparities in biomedical data science between commercial and open-source LLMs. Leveraging efficient multi-threaded and multi-turn trajectory sampling in MedAgentGym, Med-Copilot achieves performance gains of +43.02% and +45.28% from offline and online reinforcement learning, respectively, demonstrating…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 4

Strengths

1. Solving an important problem & provides a comprehensive benchmark on real-world medical tasks. 2. Very rigorous evals across many tasks over many models 3. Strong Performance Gains of Med-Copilot on the env

Weaknesses

- Only execution evals, no assessment of intermediate reasoning and steps which is vital in medicine. Where the trajectory matters as much as the solution - Big OOD drops unexplained on external dataset (more validation and digging). Maybe things are just overfit?

Reviewer 02Rating 8Confidence 2

Strengths

* The paper is written to a good standard and certainly looks publication-ready * The consideration of open-source LLMs for the papers setting of biomedical data science is important, as a large amount of data will be under stringent data privacy rules. Many alternative similar papers operating in this field do not consider this * Good contribution over prior literature, encapsulating the majority of tasks that I believe would be applicable for biomedical data science * Extensive benchmarking of

Weaknesses

* I have a feeling that the the title and naming given to the training environment is slightly overstepping and too generelized. Perhaps 'BioMedAgentGym' is more suitable. * Since the paper is releasing a training environment for the practical real-world use of biomedical data science, I would like to see some discussion on the implications of this and reccomendations to users (please see questions below) * I cannot see many weakensses, though I am not familiar with the field of biomedical res

Reviewer 03Rating 4Confidence 3

Strengths

1. MedAgentGym aggregates an exceptionally broad and diverse set of medical code-centric tasks. 2. The environment is built around reproducible, interactive Docker sandboxes, allowing code execution, error handling, debugging, and dynamic dependency install, addressing reproducibility and privacy. 3. Med-Copilot exhibit strong improvements over baselines, with RL strategies yielding notable boosts, and ablation studies clarifying contributions.

Weaknesses

1. MedAgentGym is constructed by integrating 12 existing datasets. Although it provides a division between training and test sets, the model is exposed to the task types and data patterns from these datasets during training. Therefore, its strong performance on the internal test set may partially result from memorization of specific task patterns or overfitting, rather than genuinely acquiring a universal biomedical code reasoning capability. 2. While integrating these components into a large-sc

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Natural Language Processing Techniques · Topic Modeling