When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models
Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-jui Hsieh

TL;DR
This paper identifies a bias in the BT-loss used for reward modeling in LLMs caused by representation distance, and proposes NormBT, a normalization method that improves reward model performance by mitigating this bias.
Contribution
The authors analyze the gradient behavior of BT-loss, reveal the impact of representation distance bias, and introduce NormBT, a simple normalization scheme that enhances reward model training.
Findings
NormBT improves reward model performance by over 5% on RewardBench Reasoning.
Representation distance significantly influences gradient magnitude and learning signals.
NormBT consistently outperforms standard BT-loss across various models and datasets.
Abstract
Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show spurious learning signals due to representation distance. In particular, BT gradient norm scales with two distinct components: (1) prediction error, reflected by the difference in predicted rewards between chosen and rejected responses, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper presents a mathematical decomposition of the Bradley-Terry model in the reward modeling context. 2. While occasionally underperforming compared to the baselines, NormBT generally demonstrates strong performance on the benchmark. 3. The post-hoc analysis on small-margin items in Figure 4 aligns the motivation and empirical consequences, making the method more convincing.
Despite the strengths, the paper can be improved with more empirical support to demonstrate the practical advantage of NormBT. 1. **Reward modeling baselines**: The authors compare NormBT against several variants of BT loss in Section 3. Given that the main objective of this paper is to develop a reward modeling algorithm that effectively captures true preferences in the data, the baselines need not be limited to BT loss variants. Specifically, a few points on GRM [1] need to be clarified by th
1. The paper tackles an important problem of the impact of representation distance on alignment, Figure 1 is an important illustration. 2. The gradient norm analysis is crisp and the norm of the gradient is neatly depicted to be dependent on prediction error and representation distance. 3. The final objective for NormBT is quite intuitive and easy to understand.
1. Most direct alignment methods (DAAs) like DPO [1], SimPO [2] and AlphaPO [3] skip the reward modeling stage. DAAs are the most popular ways to align LLMs these days, making the paper a bit limited in its impact. 2. Reward models can easily get over optimized. The paper lacks ablations and experiments discussing the careful optimization of RMs during training. 3. The baselines are not explained in detail 4. The experiments are not trustworthy because there are no error bars, very small models
1. Clear identification of a structural bias in BT-loss: The decomposition of gradient norm (Eq. 7) elegantly shows that update magnitude scales with both prediction error and representation distance. This theoretical insight provides a solid foundation for understanding how BT-based reward models may fail to learn from fine-grained preference pairs, especially in reasoning-oriented data. 2. Simple, lightweight correction: NormBT is a “drop-in” modification requiring no architectural change. By
1. Limited generality of theoretical analysis: The derivation in Eq. 7 assumes a linear score head $r(x, y) = w_s^T h_\phi(x, y)$ and a Lipschitz-smooth embedding map. This simplification ignores the nonlinear components of modern reward models (layer normalization, activation scaling, or residual mixing). The paper does not test whether the same coupling holds under non-linear heads or multi-layer scoring networks. Therefore, the claimed “representation distance bias” may be an artifact of this
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
