Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
Binghai Wang, Yantao Liu, Yuxuan Liu, Tianyi Tang, Shenzhi Wang, Chang Gao, Chujie Zheng, Yichang Zhang, Le Yu, Shixuan Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Bowen Yu, Fei Huang, Junyang Lin

TL;DR
This paper introduces Rationale Consistency, a new metric for aligning reward models' reasoning with human judgment, and demonstrates its effectiveness in improving model performance and avoiding deceptive alignment.
Contribution
The paper proposes Rationale Consistency as a fine-grained alignment metric and combines it with outcome accuracy to enhance reward model training and evaluation.
Findings
Rationale Consistency effectively detects deceptive alignment.
Hybrid training improves performance on RM-Bench and JudgeBench.
Method enhances RLHF outcomes, especially in creative tasks.
Abstract
Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Mental Health via Writing · Multimodal Machine Learning Applications
