TL;DR
This paper introduces AlphaMed, a medical large language model that achieves state-of-the-art reasoning and question-answering performance through minimalist rule-based reinforcement learning without relying on supervised fine-tuning or chain-of-thought data.
Contribution
AlphaMed demonstrates that reasoning capabilities can emerge solely through reinforcement learning with rule-based rewards, surpassing models trained with traditional supervised methods.
Findings
Minimalist RL effectively induces reasoning without CoT supervision.
Dataset informativeness significantly impacts reasoning performance.
Challenging benchmarks reveal limitations and the need for better evaluation methods.
Abstract
Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
