Towards Reliable, Uncertainty-Aware Alignment
Debangshu Banerjee, Kintan Saha, Aditya Gopalan

TL;DR
This paper investigates the instability in reward model training for LLM alignment, demonstrating how reward model variability can cause overfitting, and proposes a variance-aware policy optimization method to improve stability and robustness.
Contribution
The paper introduces a variance-aware policy optimization framework that incorporates reward model uncertainty to enhance alignment stability.
Findings
Reward models trained on the same data can significantly disagree.
Variability in reward estimation can lead to overfitting and performance degradation.
Variance-aware optimization improves alignment robustness and stability.
Abstract
Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model estimate can render it vulnerable to inaccuracies in the reward model. We empirically study the variability of reward model training on open-source benchmarks. We observe that independently trained reward models on the same preference dataset can exhibit substantial disagreement, highlighting the instability of current alignment strategies. Employing a theoretical model, we demonstrate that variability in reward model estimation can cause overfitting, leading to the risk of performance degradation. To mitigate this risk, we propose a variance-aware policy optimization framework for preference-based alignment. The key ingredient of the framework is a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Database Systems and Queries · Semantic Web and Ontologies · Business Process Modeling and Analysis
