MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

Lingyan Wu; Xiang Zheng; Weiqi Zhai; Wei Wang; Xuan Ren; Zifan Zhang; Hu Wei; and Bing Zhao

arXiv:2604.17282·cs.CL·April 21, 2026

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

Lingyan Wu, Xiang Zheng, Weiqi Zhai, Wei Wang, Xuan Ren, Zifan Zhang, Hu Wei, and Bing Zhao

PDF

1 Models

TL;DR

MedPRMBench is the first detailed benchmark for evaluating process reward models in medical reasoning, addressing a critical gap in safety and knowledge assessment for healthcare AI.

Contribution

It introduces a comprehensive, fine-grained medical PRM benchmark with a new severity grading system, and provides baseline results highlighting current model weaknesses.

Findings

01

Medical PRM baseline achieves 87.1% PRMScore, surpassing baselines.

02

The benchmark covers 14 error types across three categories with severity levels.

03

Evaluation reveals significant gaps in current models' medical reasoning error detection.

Abstract

Process-Level Reward Models (PRMs) are essential for guiding complex reasoning in large language models, yet existing PRM benchmarks cover only general domains such as mathematics, failing to address medical reasoning -- which is uniquely characterized by safety criticality, knowledge intensity, and diverse error patterns. Without a reliable medical PRM evaluation framework, we cannot quantify models' error detection capabilities in clinical reasoning, leaving their safety in real-world healthcare applications unverified. We propose MedPRMBench, the first process-level reward model benchmark for the medical domain. Built through a three-phase pipeline based on Clinical Reasoning Blueprints (CRBs), MedPRMBench systematically generates high-quality evaluation data from seven medical QA sources, covering 14 fine-grained error types across three categories (Simplicity, Soundness, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
keval-sha/medgemma-cardiac-training-plan
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.