Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

Jiazheng Zhang; Wenqing Jing; Zizhuo Zhang; Zhiheng Xi; Shihan Dou; Rongxiang Weng; Jiahuan Li; Jingang Wang; Mingxu Chai; Shibo Hong; Tao Gui; Qi Zhang

arXiv:2505.10597·cs.LG·May 20, 2025

Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

Jiazheng Zhang, Wenqing Jing, Zizhuo Zhang, Zhiheng Xi, Shihan Dou, Rongxiang Weng, Jiahuan Li, Jingang Wang, Mingxu Chai, Shibo Hong, Tao Gui, Qi Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a collaborative reward modeling framework that improves the robustness of reward models for LLM alignment by filtering noisy preferences through peer review and curriculum learning, significantly enhancing generalization.

Contribution

It proposes an online collaborative reward modeling approach with peer review and curriculum learning to effectively filter noisy preferences and improve reward model robustness.

Findings

01

CRM improves reward model generalization by up to 9.94 points on RewardBench.

02

Filtering noisy preferences enhances LLM alignment performance.

03

CRM extends to implicit-reward alignment methods.

Abstract

Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human values. However, noisy preferences in human feedback can lead to reward misgeneralization - a phenomenon where reward models learn spurious correlations or overfit to noisy preferences, which poses important challenges to the generalization of RMs. This paper systematically analyzes the characteristics of preference pairs and aims to identify how noisy preferences differ from human-aligned preferences in reward modeling. Our analysis reveals that noisy preferences are difficult for RMs to fit, as they cause sharp training fluctuations and irregular gradient updates. These distinctive dynamics suggest the feasibility of identifying and excluding such noisy preferences. Empirical studies demonstrate that policy LLM optimized with a reward model trained on the full preference dataset, which includes…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. The selected problem setting, i.e., noisy preference data in RLHF is interesting, and is known to impact downstream alignment performance. This paper gives a demonstration through the definition of robust preference pairs, incorrect preference pairs and ambiguous preference pairs, though the the provided results in Figure 2 and 3 may be inaccurate because of "self-loss". 2. The proposed CRM framework is very simple and easy to implement: two reward models select low-loss pairs for each other

Weaknesses

1. The proposed CRM is limited in theory, and its designs including using self-loss, data filtering strategy and etc. are mostly heuristic, which indicates it may not be effective or extendable in other settings. For example, why choose two models evolve instead of 3 or more? 2. The empirical settings may have fetal flaw that the noise in the data is manually added by symmetric label flipping, which cannot reflect the complex situation in alignment of real world. The used "self-loss" for disting

Reviewer 02Rating 6Confidence 3

Strengths

The paper is well written and is an important problem. The methodology described is novel as per to my knowledge. Several detailed experiments are provided

Weaknesses

I have one primary issue with the experiment results described. Technically, in CRM, one is training twice the number of reward parameters. Thus, it seems that comparing it with other methodologies that just use a single reward model is not fair. I think for an effective demonstration, the authors should compare the results with a 6B reward model. Otherwise, I am not convinced that the performance is solely due to the increased number of parameters being trained. If the authors can elaborate o

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper systematically analyzes the intrinsic characteristics of preference pairs and proposes an online noise filtering method rather than merely improving training objectives based on this analysis. 2. The method demonstrates strong generalizability, being applicable to both explicit and implicit reward alignment approaches, and the paper is overall clear, coherent, and well-written.

Weaknesses

1. In the Peer Review stage, two reward models are trained collaboratively; however, this dual-model setup approximately doubles the training cost. The paper does not analyze training efficiency, convergence speed, or scalability with respect to data size, and such analysis is recommended. 2. In Section 3.2, the paper states that the two reward models determine their sample selection ratio based on the noise rate to mutually update and improve performance, but it does not explain how the noise r

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Topic Modeling · Explainable Artificial Intelligence (XAI)