Thompson Sampling in Online RLHF with General Function Approximation
Songtao Feng, Jie Fu

TL;DR
This paper introduces a Thompson sampling-based algorithm for online reinforcement learning from human feedback with function approximation, providing theoretical regret guarantees and new concentration inequalities.
Contribution
It develops a model-free posterior sampling method for online RLHF with theoretical regret bounds using Bellman eluder dimension, advancing understanding of RLHF efficiency.
Findings
Achieves $O(\sqrt{T})$ regret bound for the proposed algorithm.
Establishes a new concentration inequality for squared Bellman error based on MLE.
Provides theoretical analysis linking function class complexity to regret in RLHF.
Abstract
Reinforcement learning from human feedback (RLHF) has achieved great empirical success in aligning large language models (LLMs) with human preference, and it is of great importance to study the statistical efficiency of RLHF algorithms from a theoretical perspective. In this work, we consider the online RLHF setting where the preference data is revealed during the learning process and study action value function approximation. We design a model-free posterior sampling algorithm for online RLHF inspired by Thompson sampling and provide its theoretical guarantee. Specifically, we adopt Bellman eluder (BE) dimension as the complexity measure of the function class and establish regret bound for the proposed algorithm with other multiplicative factor depending on the horizon, BE dimension and the -bracketing number of the function class. Further, in the analysis, we first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAntenna Design and Analysis · Advanced Mathematical Physics Problems
MethodsADaptive gradient method with the OPTimal convergence rate
