Variance-aware Reward Modeling with Anchor Guidance
Shuxing Fang, Ruijian Han, Liangyu Zhang, Fan Zhou

TL;DR
This paper introduces Anchor-guided Variance-aware Reward Modeling, which enhances reward models by incorporating coarse anchor labels to resolve non-identifiability, improving performance in RLHF tasks.
Contribution
It proposes a novel framework that uses anchor labels to identify reward variance, with theoretical guarantees and improved empirical results over existing models.
Findings
Improves reward modeling performance across multiple datasets.
Enhances downstream RLHF tasks like PPO training and best-of-N selection.
Provides theoretical proof of identification with two anchors.
Abstract
Standard Bradley--Terry (BT) reward models are limited when human preferences are pluralistic. Although soft preference labels preserve disagreement information, BT can only express it by shrinking reward margins. Gaussian reward models provide an alternative by jointly predicting a reward mean and a reward variance, but suffer from a fundamental non-identifiability from pairwise preferences alone. We propose Anchor-guided Variance-aware Reward Modeling, a framework that resolves this non-identifiability by augmenting preference data with two coarse response-level anchor labels. Building on this, we prove that two anchors are sufficient for identification, develop a joint training objective and establish a non-asymptotic convergence rate for both the estimated reward mean and variance functions. Across simulation studies and four real-world diverging-preference datasets, our method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
