$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

Di Wu; Chengshuai Shi; Jing Yang; Cong Shen

arXiv:2605.06977·cs.LG·May 11, 2026

$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

Di Wu, Chengshuai Shi, Jing Yang, Cong Shen

PDF

TL;DR

This paper develops a unified theoretical framework for reinforcement learning from human feedback using general $f$-divergences, proposing two algorithms with proven efficiency and performance bounds.

Contribution

It introduces a holistic analysis of $f$-divergence regularization in RLHF and presents two novel algorithms with theoretical guarantees.

Findings

01

Achieves $O(\log T)$ regret bounds.

02

Establishes $O(1/T)$ sub-optimality gap.

03

First to provide performance bounds for general $f$-divergence regularized online RLHF.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone technique for post-training large language models. While most existing approaches rely on the reverse KL-regularization, recent empirical studies have begun exploring alternative divergences (e.g., forward KL, chi-squared) as regularizers in RLHF. However, a unified theoretical understanding of general $f$ -divergence regularization remains under-explored. To fill this gap, this work develops a comprehensive theoretical framework for online RLHF with a general $f$ -divergence regularized objective. Rather than treating each possible divergence function individually, we adopt a holistic perspective across the entire function class and propose two algorithms based on distinct sampling principles. The first extends the classical optimism principle with a carefully designed exploration bonus, while the second…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.