Bradley-Terry and Multi-Objective Reward Modeling Are Complementary

Zhiwei Zhang; Hui Liu; Xiaomin Li; Zhenwei Dai; Jingying Zeng; Fali Wang; Minhua Lin; Ramraj Chandradevan; Zhen Li; Chen Luo; Xianfeng Tang; Qi He; Suhang Wang

arXiv:2507.07375·cs.LG·July 11, 2025

Bradley-Terry and Multi-Objective Reward Modeling Are Complementary

Zhiwei Zhang, Hui Liu, Xiaomin Li, Zhenwei Dai, Jingying Zeng, Fali Wang, Minhua Lin, Ramraj Chandradevan, Zhen Li, Chen Luo, Xianfeng Tang, Qi He, Suhang Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a unified reward modeling framework combining Bradley--Terry and multi-objective methods, significantly enhancing robustness and performance of reward models in out-of-distribution scenarios for aligning language models.

Contribution

The paper proposes a joint training framework for Bradley--Terry and multi-objective reward functions, establishing their theoretical connection and demonstrating improved robustness and scoring in OOD settings.

Findings

01

Joint training improves OOD robustness of reward models

02

Multi-objective scores enhance alignment accuracy

03

Framework outperforms larger baseline models

Abstract

Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior. Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution. In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* The paper is well written and clearly positioned relative to prior work, while thoughtfully incorporating insights from it. * The method is simple and well-motivated, and the theoretical analysis makes the core idea more compelling. * The evaluation is thorough, covering both reward modeling and RLHF.

Weaknesses

* The design of SMORL-L uses the mean of the multi-objective scores. However, using the mean does not make sense, which means you give all rewards equal weights given all prompts. Actually, this should be utilized in a contextual way and dynamically adapt to the user prompts [1][2]. * The authors claim SMORM is flexible because the two heads are trained on different prompt–response pairs. In practice, this complicates using and coordinating two distinct datasets. I therefore disagree with the f

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper offers a theoretically well-grounded contribution. The link between BT loss and regression loss via Fisher Information analysis is original and potentially influential. 2. The joint single/multi-objective framework is conceptually neat and aligns with the broader goal of multi-dimensional reward alignment. 3. The idea of embedding-space complementarity between preference learning and multi-attribute regression is interesting and could inspire follow-up research. 4. Writing and struc

Weaknesses

1. Lack of statistical rigor. The paper reports only mean scores without standard deviations or multiple-seed averages. Given the small performance margins (≈1–2 points on RewardBench or RM-Bench), these gains could easily fall within noise. This omission is especially problematic because the theoretical claim centers on variance reduction, yet no variance statistics are provided. 2. Weak OOD validation and limited generalization coverage. The paper claims robustness “in OOD settings” but evalu

Reviewer 03Rating 6Confidence 2

Strengths

- **Clear OOD focus with concrete evidence.** The paper explicitly studies PPO/BoN under prompt-distribution shifts and shows baselines (e.g., GRM, ODIN) misspecify or overfit signals that lead to reward hacking, whereas SMORM variants maintain rising gold scores. - **Principled link between BT and regression.** Lemma 1 upper-bounds expected BT loss by the regression MSE; Theorem 2 uses Fisher information to argue strictly better asymptotic MSE for both heads under joint training. - **Practi

Weaknesses

- **Assumptions (1–3) may be stringent in practice.** Theorem 1 requires (i) **bounded features**, (ii) **positive-definite** covariance matrices for BT and multi-objective features, and (iii) a **positive-correlation** condition via a coupling vector. In realistic learned embeddings, covariances can be **rank-deficient** (e.g., collinearity, limited per-attribute data, heavy regularization), and verifying the correlation condition empirically is non-trivial. Discussion or diagnostics on how oft

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Recommender Systems and Techniques · Machine Learning and Data Classification

MethodsFocus