Online Bandit Learning with Offline Preference Data for Improved RLHF
Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

TL;DR
This paper introduces warmPref-PS, a novel online learning algorithm that leverages offline preference data with noisy feedback to improve reinforcement learning with human feedback, supported by theoretical and empirical results.
Contribution
It proposes a new posterior sampling algorithm that effectively incorporates offline preference data into online RLHF, accounting for expert competence.
Findings
Theoretical analysis shows improved Bayesian regret bounds.
Empirical results demonstrate superior performance over baselines.
The method effectively utilizes offline data to enhance online learning.
Abstract
Reinforcement Learning with Human Feedback (RLHF) is at the core of fine-tuning methods for generative AI models for language and images. Such feedback is often sought as rank or preference feedback from human raters, as opposed to eliciting scores since the latter tends to be noisy. On the other hand, RL theory and algorithms predominantly assume that a reward feedback is available. In particular, approaches for online learning that can be helpful in adaptive data collection via active learning cannot incorporate offline preference data. In this paper, we adopt a finite-armed linear bandit model as a prototypical model of online learning. We consider an offline preference dataset to be available generated by an expert of unknown 'competence'. We propose warmPref-PS, a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Data Stream Mining Techniques · Machine Learning and Data Classification
