MedS$^3$: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision
Shuyang Jiang, Yusheng Liao, Zhe Chen, Ya Zhang, Yanfeng Wang, Yu Wang

TL;DR
MedS3 is a novel self-evolving framework that enhances small medical language models with robust, fine-grained reasoning capabilities through reinforcement learning and a dual process reward system, improving accuracy and reasoning fidelity.
Contribution
The paper introduces MedS3, a self-evolving, rule-verifiable reasoning framework for medical language models, utilizing Monte Carlo Tree Search and a dual process reward model for improved clinical reasoning.
Findings
Outperforms previous state-of-the-art medical models by +6.45 accuracy points.
Surpasses 32B-scale general reasoning models by +8.57 points.
Achieves robust and faithful reasoning behavior in medical tasks.
Abstract
Medical language models face critical barriers to real-world clinical reasoning applications. However, mainstream efforts, which fall short in task coverage, lack fine-grained supervision for intermediate reasoning steps, and rely on proprietary systems, are still far from a versatile, credible and efficient language model for clinical reasoning usage. To this end, we propose MedS3, a self-evolving framework that imparts robust reasoning capabilities to small, deployable models. Starting with 8,000 curated instances sampled via a curriculum strategy across five medical domains and 16 datasets, we use a small base policy model to conduct Monte Carlo Tree Search (MCTS) for constructing rule-verifiable reasoning trajectories. Self-explored reasoning trajectories ranked by node values are used to bootstrap the policy model via reinforcement fine-tuning and preference learning. Moreover, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling
MethodsBalanced Selection
