$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses
Di Wu, Chengshuai Shi, Jing Yang, Cong Shen

TL;DR
This paper develops a unified theoretical framework for reinforcement learning from human feedback using general $f$-divergences, proposing two algorithms with proven efficiency and performance bounds.
Contribution
It introduces a holistic analysis of $f$-divergence regularization in RLHF and presents two novel algorithms with theoretical guarantees.
Findings
Achieves $O(\log T)$ regret bounds.
Establishes $O(1/T)$ sub-optimality gap.
First to provide performance bounds for general $f$-divergence regularized online RLHF.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for post-training large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun exploring alternative divergences (e.g., forward KL, chi-squared) as regularizers in RLHF. However, a unified theoretical understanding of general -divergence regularization remains under-explored. To fill this gap, this work develops a comprehensive theoretical framework for online RLHF with a general -divergence regularized objective. Rather than treating each possible divergence function individually, we adopt a holistic perspective across the entire function class and propose two algorithms based on distinct sampling principles. The first extends the classical optimism principle with a carefully designed exploration bonus, while the second…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
