Towards Reliable, Uncertainty-Aware Alignment

Debangshu Banerjee; Kintan Saha; Aditya Gopalan

arXiv:2507.15906·cs.LG·July 23, 2025

Towards Reliable, Uncertainty-Aware Alignment

Debangshu Banerjee, Kintan Saha, Aditya Gopalan

PDF

Open Access

TL;DR

This paper investigates the instability in reward model training for LLM alignment, demonstrating how reward model variability can cause overfitting, and proposes a variance-aware policy optimization method to improve stability and robustness.

Contribution

The paper introduces a variance-aware policy optimization framework that incorporates reward model uncertainty to enhance alignment stability.

Findings

01

Reward models trained on the same data can significantly disagree.

02

Variability in reward estimation can lead to overfitting and performance degradation.

03

Variance-aware optimization improves alignment robustness and stability.

Abstract

Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model estimate can render it vulnerable to inaccuracies in the reward model. We empirically study the variability of reward model training on open-source benchmarks. We observe that independently trained reward models on the same preference dataset can exhibit substantial disagreement, highlighting the instability of current alignment strategies. Employing a theoretical model, we demonstrate that variability in reward model estimation can cause overfitting, leading to the risk of performance degradation. To mitigate this risk, we propose a variance-aware policy optimization framework for preference-based alignment. The key ingredient of the framework is a new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Database Systems and Queries · Semantic Web and Ontologies · Business Process Modeling and Analysis