On the Robustness of Reward Models for Language Model Alignment

Jiwoo Hong; Noah Lee; Eunki Kim; Guijin Son; Woojin Chung; Aman Gupta; Shao Tang; James Thorne

arXiv:2505.07271·cs.CL·May 13, 2025

On the Robustness of Reward Models for Language Model Alignment

Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne

PDF

Open Access

TL;DR

This paper investigates over-optimization issues in reward models for language model alignment, proposing a regularization method to improve robustness and enhance alignment performance in RLHF.

Contribution

It identifies the dispersion of hidden state norms as a cause of over-optimization and introduces batch-wise sum-to-zero regularization to improve reward model robustness.

Findings

01

BSR improves robustness in over-optimization scenarios.

02

Robust RMs better align policies to gold preferences.

03

Applying BSR enhances performance on large-scale preference prediction.

Abstract

The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Natural Language Processing Techniques

MethodsALIGN