Variance-aware Reward Modeling with Anchor Guidance

Shuxing Fang; Ruijian Han; Liangyu Zhang; Fan Zhou

arXiv:2605.11865·stat.ML·May 13, 2026

Variance-aware Reward Modeling with Anchor Guidance

Shuxing Fang, Ruijian Han, Liangyu Zhang, Fan Zhou

PDF

TL;DR

This paper introduces Anchor-guided Variance-aware Reward Modeling, which enhances reward models by incorporating coarse anchor labels to resolve non-identifiability, improving performance in RLHF tasks.

Contribution

It proposes a novel framework that uses anchor labels to identify reward variance, with theoretical guarantees and improved empirical results over existing models.

Findings

01

Improves reward modeling performance across multiple datasets.

02

Enhances downstream RLHF tasks like PPO training and best-of-N selection.

03

Provides theoretical proof of identification with two anchors.

Abstract

Standard Bradley--Terry (BT) reward models are limited when human preferences are pluralistic. Although soft preference labels preserve disagreement information, BT can only express it by shrinking reward margins. Gaussian reward models provide an alternative by jointly predicting a reward mean and a reward variance, but suffer from a fundamental non-identifiability from pairwise preferences alone. We propose Anchor-guided Variance-aware Reward Modeling, a framework that resolves this non-identifiability by augmenting preference data with two coarse response-level anchor labels. Building on this, we prove that two anchors are sufficient for identification, develop a joint training objective and establish a non-asymptotic convergence rate for both the estimated reward mean and variance functions. Across simulation studies and four real-world diverging-preference datasets, our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.