Thompson Sampling in Online RLHF with General Function Approximation

Songtao Feng; Jie Fu

arXiv:2505.23927·cs.LG·June 2, 2025

Thompson Sampling in Online RLHF with General Function Approximation

Songtao Feng, Jie Fu

PDF

Open Access

TL;DR

This paper introduces a Thompson sampling-based algorithm for online reinforcement learning from human feedback with function approximation, providing theoretical regret guarantees and new concentration inequalities.

Contribution

It develops a model-free posterior sampling method for online RLHF with theoretical regret bounds using Bellman eluder dimension, advancing understanding of RLHF efficiency.

Findings

01

Achieves $O(\sqrt{T})$ regret bound for the proposed algorithm.

02

Establishes a new concentration inequality for squared Bellman error based on MLE.

03

Provides theoretical analysis linking function class complexity to regret in RLHF.

Abstract

Reinforcement learning from human feedback (RLHF) has achieved great empirical success in aligning large language models (LLMs) with human preference, and it is of great importance to study the statistical efficiency of RLHF algorithms from a theoretical perspective. In this work, we consider the online RLHF setting where the preference data is revealed during the learning process and study action value function approximation. We design a model-free posterior sampling algorithm for online RLHF inspired by Thompson sampling and provide its theoretical guarantee. Specifically, we adopt Bellman eluder (BE) dimension as the complexity measure of the function class and establish $O (T)$ regret bound for the proposed algorithm with other multiplicative factor depending on the horizon, BE dimension and the $l o g$ -bracketing number of the function class. Further, in the analysis, we first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAntenna Design and Analysis · Advanced Mathematical Physics Problems

MethodsADaptive gradient method with the OPTimal convergence rate