Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

Che Liu; Haozhe Wang; Jiazhen Pan; Zhongwei Wan; Yong Dai; Fangzhen Lin; Wenjia Bai; Daniel Rueckert; Rossella Arcucci

arXiv:2505.17952·cs.CL·May 26, 2025

Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

Che Liu, Haozhe Wang, Jiazhen Pan, Zhongwei Wan, Yong Dai, Fangzhen Lin, Wenjia Bai, Daniel Rueckert, Rossella Arcucci

PDF

5 Models

TL;DR

This paper introduces AlphaMed, a medical large language model that achieves state-of-the-art reasoning and question-answering performance through minimalist rule-based reinforcement learning without relying on supervised fine-tuning or chain-of-thought data.

Contribution

AlphaMed demonstrates that reasoning capabilities can emerge solely through reinforcement learning with rule-based rewards, surpassing models trained with traditional supervised methods.

Findings

01

Minimalist RL effectively induces reasoning without CoT supervision.

02

Dataset informativeness significantly impacts reasoning performance.

03

Challenging benchmarks reveal limitations and the need for better evaluation methods.

Abstract

Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.