Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients

Armin Berger; Manuela Bergau; Helen Schneider; Saad Ahmad; Tom Anglim Lagones; Gianluca Brugnara; Martha Foltyn-Dumitru; Kai Schlamp; Philipp Vollmuth; Rafet Sifa

arXiv:2512.23090·cs.AI·January 5, 2026

Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients

Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa

PDF

Open Access

TL;DR

This paper examines how reinforcement learning improves benchmark performance of medical vision-language models but can harm their ability to generalize across different datasets, highlighting a need for more robust training methods.

Contribution

It introduces ChexReason, a vision-language model trained with limited data using RL, and analyzes the impact of RL on in-distribution versus cross-dataset performance in medical imaging.

Findings

01

RL improves in-distribution performance significantly.

02

RL degrades cross-dataset transferability.

03

Supervised fine-tuning may outperform RL for robustness.

Abstract

Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Explainable Artificial Intelligence (XAI)