# MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

**Authors:** Weihai Zhi, Jiayan Guo, Shangyang Li

arXiv: 2508.20549 · 2025-12-09

## TL;DR

MedGR$^2$ introduces a self-improving framework that generates high-quality medical data to enhance training and reinforcement learning, significantly improving medical reasoning models' generalization with less reliance on scarce expert annotations.

## Contribution

The paper proposes MedGR$^2$, a novel generative reward learning framework that creates synthetic medical data for improved supervised fine-tuning and reinforcement learning in medical AI.

## Key findings

- MedGR$^2$-generated data outperforms large-scale human-curated datasets in training.
- Models trained with MedGR$^2$ data achieve state-of-the-art generalization across modalities and tasks.
- A compact model with MedGR$^2$ rivals larger foundation models in performance.

## Abstract

The application of Vision-Language Models (VLMs) in medicine is critically hampered by the scarcity of high-quality, expert-annotated data. Supervised Fine-Tuning (SFT) on existing datasets often leads to poor generalization on unseen modalities and tasks, while Reinforcement Learning (RL), a promising alternative, is stymied by the lack of reliable reward signals in this data-scarce domain. To break this impasse, we introduce Generative Reward Learning for Medical Reasoning (MedGR$^2$), a novel framework that creates a self-improving virtuous cycle. MedGR$^2$ co-develops a data generator and a reward model, enabling the automated, continuous creation of high-quality, multi-modal medical data that serves as both a superior training source for SFT and RL. Our experiments demonstrate that SFT with MedGR$^2$-produced data already surpasses baselines trained on large-scale, human-curated datasets. Crucially, when leveraging this data for RL via Group Relative Policy Optimization (GRPO), our model achieves state-of-the-art cross-modality and cross-task generalization, significantly outperforming specialized RL-based methods. Furthermore, our compact model, empowered by MedGR$^2$, achieves performance competitive with foundation models possessing over 10 times more parameters. MedGR$^2$ presents a new paradigm for data-efficient learning in high-stakes domains, transforming the problem from data scarcity to data generation and unlocking the full potential of RL for building truly generalizable medical AI.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20549/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20549/full.md

---
Source: https://tomesphere.com/paper/2508.20549