Bradley-Terry and Multi-Objective Reward Modeling Are Complementary
Zhiwei Zhang, Hui Liu, Xiaomin Li, Zhenwei Dai, Jingying Zeng, Fali Wang, Minhua Lin, Ramraj Chandradevan, Zhen Li, Chen Luo, Xianfeng Tang, Qi He, Suhang Wang

TL;DR
This paper introduces a unified reward modeling framework combining Bradley--Terry and multi-objective methods, significantly enhancing robustness and performance of reward models in out-of-distribution scenarios for aligning language models.
Contribution
The paper proposes a joint training framework for Bradley--Terry and multi-objective reward functions, establishing their theoretical connection and demonstrating improved robustness and scoring in OOD settings.
Findings
Joint training improves OOD robustness of reward models
Multi-objective scores enhance alignment accuracy
Framework outperforms larger baseline models
Abstract
Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior. Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution. In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of…
Peer Reviews
Decision·ICLR 2026 Poster
* The paper is well written and clearly positioned relative to prior work, while thoughtfully incorporating insights from it. * The method is simple and well-motivated, and the theoretical analysis makes the core idea more compelling. * The evaluation is thorough, covering both reward modeling and RLHF.
* The design of SMORL-L uses the mean of the multi-objective scores. However, using the mean does not make sense, which means you give all rewards equal weights given all prompts. Actually, this should be utilized in a contextual way and dynamically adapt to the user prompts [1][2]. * The authors claim SMORM is flexible because the two heads are trained on different prompt–response pairs. In practice, this complicates using and coordinating two distinct datasets. I therefore disagree with the f
1. The paper offers a theoretically well-grounded contribution. The link between BT loss and regression loss via Fisher Information analysis is original and potentially influential. 2. The joint single/multi-objective framework is conceptually neat and aligns with the broader goal of multi-dimensional reward alignment. 3. The idea of embedding-space complementarity between preference learning and multi-attribute regression is interesting and could inspire follow-up research. 4. Writing and struc
1. Lack of statistical rigor. The paper reports only mean scores without standard deviations or multiple-seed averages. Given the small performance margins (≈1–2 points on RewardBench or RM-Bench), these gains could easily fall within noise. This omission is especially problematic because the theoretical claim centers on variance reduction, yet no variance statistics are provided. 2. Weak OOD validation and limited generalization coverage. The paper claims robustness “in OOD settings” but evalu
- **Clear OOD focus with concrete evidence.** The paper explicitly studies PPO/BoN under prompt-distribution shifts and shows baselines (e.g., GRM, ODIN) misspecify or overfit signals that lead to reward hacking, whereas SMORM variants maintain rising gold scores. - **Principled link between BT and regression.** Lemma 1 upper-bounds expected BT loss by the regression MSE; Theorem 2 uses Fisher information to argue strictly better asymptotic MSE for both heads under joint training. - **Practi
- **Assumptions (1–3) may be stringent in practice.** Theorem 1 requires (i) **bounded features**, (ii) **positive-definite** covariance matrices for BT and multi-objective features, and (iii) a **positive-correlation** condition via a coupling vector. In realistic learned embeddings, covariances can be **rank-deficient** (e.g., collinearity, limited per-attribute data, heavy regularization), and verifying the correlation condition empirically is non-trivial. Discussion or diagnostics on how oft
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Recommender Systems and Techniques · Machine Learning and Data Classification
MethodsFocus
