Regularized Online RLHF with Generalized Bilinear Preferences
Junghyun Lee, Minju Hong, Kwang-Sung Jun, Chulhee Yun, Se-Young Yun

TL;DR
This paper introduces a new framework for online reinforcement learning from human feedback using generalized bilinear preferences, providing regret bounds and addressing high-dimensional challenges.
Contribution
It generalizes preference modeling with GBPM, proves a key dual gap bound, and offers the first statistically efficient guarantees for high-dimensional online RLHF.
Findings
Polylogarithmic regret for Greedy Sampling.
Polynomial regret for Explore-Then-Commit.
First high-dimensional regret guarantees in online RLHF.
Abstract
We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized Bilinear Preference Model (GBPM) to capture potentially intransitive preferences via low-rank, skew-symmetric matrices. We investigate general preference learning with any strongly convex regularizer and regularization strength , generalizing beyond prior work limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error, a result derived solely from strong convexity and the skew-symmetry of GBPM. Building on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic, -free regret $\tilde{\mathcal{O}}(\eta d^4…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics
