Towards Reliable Alignment: Uncertainty-aware RLHF
Debangshu Banerjee, Aditya Gopalan

TL;DR
This paper introduces an uncertainty-aware reinforcement learning approach for aligning language models, addressing reward model variability and risk by developing a conservative policy optimization method supported by theoretical and empirical evidence.
Contribution
It proposes a novel uncertainty-aware, conservative algorithm for RLHF that reduces risk and overfitting caused by reward model fluctuations, validated through theoretical analysis and experiments.
Findings
Reward model fluctuations can cause overfitting and risk in policy training.
The proposed method reduces risk compared to vanilla approaches.
Empirical results confirm theoretical risk reduction with ensemble reward models.
Abstract
Recent advances in aligning Large Language Models with human preferences have benefited from larger reward models and better preference data. However, most of these methodologies rely on the accuracy of the reward model. The reward models used in Reinforcement Learning with Human Feedback (RLHF) are typically learned from small datasets using stochastic optimization algorithms, making them prone to high variability. We illustrate the inconsistencies between reward models empirically on numerous open-source datasets. We theoretically show that the fluctuation of the reward models can be detrimental to the alignment problem because the derived policies are more overfitted to the reward model and, hence, are riskier if the reward model itself is uncertain. We use concentration of measure to motivate an uncertainty-aware, conservative algorithm for policy optimization. We show that such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification · AI-based Problem Solving and Planning · Simulation Techniques and Applications
MethodsALIGN
