# Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models

**Authors:** Meidan Ding, Jipeng Zhang, Wenxuan Wang, Cheng-Yi Li, Wei-Chieh Fang, Hsin-Yu Wu, Haiqin Zhong, Wenting Chen, Linlin Shen

arXiv: 2508.21430 · 2025-09-01

## TL;DR

Med-RewardBench is a novel benchmark designed to evaluate reward models and judges for medical multimodal large language models, focusing on clinical accuracy and relevance across diverse medical scenarios.

## Contribution

This work introduces the first dedicated benchmark for medical reward models and judges, including a comprehensive dataset and evaluation framework for clinical applications.

## Key findings

- Existing models show significant gaps in clinical alignment.
- Fine-tuning baseline models improves performance notably.
- Evaluation reveals challenges in aligning models with expert judgment.

## Abstract

Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21430/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21430/full.md

## References

60 references — full list in the complete paper: https://tomesphere.com/paper/2508.21430/full.md

---
Source: https://tomesphere.com/paper/2508.21430