Regularized Online RLHF with Generalized Bilinear Preferences

Junghyun Lee; Minju Hong; Kwang-Sung Jun; Chulhee Yun; Se-Young Yun

arXiv:2602.23116·cs.LG·March 6, 2026

Regularized Online RLHF with Generalized Bilinear Preferences

Junghyun Lee, Minju Hong, Kwang-Sung Jun, Chulhee Yun, Se-Young Yun

PDF

Open Access

TL;DR

This paper introduces a new framework for online reinforcement learning from human feedback using generalized bilinear preferences, providing regret bounds and addressing high-dimensional challenges.

Contribution

It generalizes preference modeling with GBPM, proves a key dual gap bound, and offers the first statistically efficient guarantees for high-dimensional online RLHF.

Findings

01

Polylogarithmic regret for Greedy Sampling.

02

Polynomial regret for Explore-Then-Commit.

03

First high-dimensional regret guarantees in online RLHF.

Abstract

We consider the problem of contextual online RLHF with general preferences, where the goal is to identify the Nash Equilibrium. We adopt the Generalized Bilinear Preference Model (GBPM) to capture potentially intransitive preferences via low-rank, skew-symmetric matrices. We investigate general preference learning with any strongly convex regularizer and regularization strength $η^{- 1}$ , generalizing beyond prior work limited to reverse KL-regularization. Central to our analysis is proving that the dual gap of the greedy policy is bounded by the square of the estimation error, a result derived solely from strong convexity and the skew-symmetry of GBPM. Building on this insight and a feature diversity assumption, we establish two regret bounds via two simple algorithms: (1) Greedy Sampling achieves polylogarithmic, $e^{O (η)}$ -free regret $\tilde{\mathcal{O}}(\eta d^4…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics