On The Global Convergence Of Online RLHF With Neural Parametrization
Mudit Gaur, Amrit Singh Bedi, Raghu Pasupathy, Vaneet Aggarwal

TL;DR
This paper establishes the first theoretical convergence guarantees for online RLHF with neural network parametrization, addressing distribution shift issues and proposing a bi-level optimization approach with proven efficiency.
Contribution
It introduces a bi-level formulation for neural RLHF, proposes a first-order solution method, and provides the first convergence rate bounds in this setting.
Findings
Proposed a bi-level formulation for neural RLHF.
Developed a first-order algorithm with convergence guarantees.
Achieved state-of-the-art sample complexity bounds.
Abstract
The importance of Reinforcement Learning from Human Feedback (RLHF) in aligning large language models (LLMs) with human values cannot be overstated. RLHF is a three-stage process that includes supervised fine-tuning (SFT), reward learning, and policy learning. Although there are several offline and online approaches to aligning LLMs, they often suffer from distribution shift issues. These issues arise from the inability to accurately capture the distributional interdependence between the reward learning and policy learning stages. Consequently, this has led to various approximated approaches, but the theoretical insights and motivations remain largely limited to tabular settings, which do not hold in practice. This gap between theoretical insights and practical implementations is critical. It is challenging to address this gap as it requires analyzing the performance of AI alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Advanced Adaptive Filtering Techniques
