Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Binghai Wang; Yantao Liu; Yuxuan Liu; Tianyi Tang; Shenzhi Wang; Chang Gao; Chujie Zheng; Yichang Zhang; Le Yu; Shixuan Liu; Tao Gui; Qi Zhang; Xuanjing Huang; Bowen Yu; Fei Huang; Junyang Lin

arXiv:2602.04649·cs.CL·February 5, 2026

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Binghai Wang, Yantao Liu, Yuxuan Liu, Tianyi Tang, Shenzhi Wang, Chang Gao, Chujie Zheng, Yichang Zhang, Le Yu, Shixuan Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Bowen Yu, Fei Huang, Junyang Lin

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Rationale Consistency, a new metric for aligning reward models' reasoning with human judgment, and demonstrates its effectiveness in improving model performance and avoiding deceptive alignment.

Contribution

The paper proposes Rationale Consistency as a fine-grained alignment metric and combines it with outcome accuracy to enhance reward model training and evaluation.

Findings

01

Rationale Consistency effectively detects deceptive alignment.

02

Hybrid training improves performance on RM-Bench and JudgeBench.

03

Method enhances RLHF outcomes, especially in creative tasks.

Abstract

Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Qwen/RationaleRM
dataset· 400 dl
400 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mental Health via Writing · Multimodal Machine Learning Applications