TL;DR
This paper introduces a comprehensive RL environment for medical AI agents, analyzes challenges in multi-turn clinical reasoning, and proposes TT-OPD, a self-distillation method that improves training stability and performance.
Contribution
It presents a new multi-domain clinical environment, analyzes multi-turn RL challenges, and introduces TT-OPD, a self-distillation framework that enhances training stability and effectiveness.
Findings
Agentic multi-turn RL degrades into verbose single-turn interactions.
TT-OPD improves training stability and performance across benchmarks.
Vanilla GRPO achieves high accuracy but suffers from instability.
Abstract
Clinical reasoning demands multi-step interactions -- gathering patient history, ordering tests, interpreting results, and making safe treatment decisions -- yet a unified training environment provides the breadth of clinical domains and specialized tools to train generalizable medical AI agents through reinforcement learning remains elusive. We present a comprehensive empirical study of multi-turn agentic RL for medical AI, built on \gym{}, a gymnasium-compatible environment spanning 10 clinical domains with 3.6K+ tasks, 135 domain-specific tools, and a knowledge base of 828K medical passages. Our analysis reveals that agentic multi-turn structure degrades into verbose single-turn monologues, characterized by monotonic length explosion and a simultaneous erosion of tool-use frequency. We characterize how this collapse, alongside distillation instability, stems from the misalignment of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
