Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization

Boyang Gu; Hongjian Zhou; Bradley Max Segal; Jinge Wu; Zeyu Cao; Hantao Zhong; Lei Clifton; Fenglin Liu; David A. Clifton

arXiv:2512.00601·cs.AI·December 4, 2025

Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization

Boyang Gu, Hongjian Zhou, Bradley Max Segal, Jinge Wu, Zeyu Cao, Hantao Zhong, Lei Clifton, Fenglin Liu, David A. Clifton

PDF

Open Access 1 Video

TL;DR

This paper introduces CRPO, a multi-objective reinforcement learning method that improves the faithfulness and comprehensiveness of large language models in clinical reasoning, surpassing previous methods in truthfulness and completeness.

Contribution

We propose CRPO, a scalable, multi-objective, verifiable RL approach that aligns LLMs with clinical reasoning principles without human annotations, demonstrated on a 3B model.

Findings

01

CRPO significantly improves truthfulness and completeness in clinical reasoning tasks.

02

The 3B Clinical-R1 model trained with CRPO outperforms standard methods on benchmarks.

03

CRPO maintains accuracy while enhancing faithfulness and comprehensiveness.

Abstract

Recent advances in large language models (LLMs) have shown strong reasoning capabilities through large-scale pretraining and post-training reinforcement learning, demonstrated by DeepSeek-R1. However, current post-training methods, such as Grouped Relative Policy Optimization (GRPO), mainly reward correctness, which is not aligned with the multi-dimensional objectives required in high-stakes fields such as medicine, where reasoning must also be faithful and comprehensive. We introduce Clinical-Objective Relative Policy Optimization (CRPO), a scalable, multi-objective, verifiable reinforcement learning method designed to align LLM post-training with clinical reasoning principles. CRPO integrates rule-based and verifiable reward signals that jointly optimize accuracy, faithfulness, and comprehensiveness without relying on human annotation. To demonstrate its effectiveness, we train…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Clinical-R1: Empowering Large Language Models for Faithful and Comprehensive Reasoning with Clinical Objective Relative Policy Optimization· underline

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling