Provably Efficient Online RLHF with One-Pass Reward Modeling
Long-Fei Li, Yu-Yang Qian, Peng Zhao, Zhi-Hua Zhou

TL;DR
This paper introduces a one-pass reward modeling algorithm for online RLHF that avoids storing historical data, enabling constant-time updates and improving efficiency in aligning language models with human preferences.
Contribution
It formalizes online RLHF as a preference bandit problem and develops a novel online mirror descent algorithm with theoretical guarantees, reducing computational costs.
Findings
Achieves constant-time updates per iteration.
Demonstrates improved statistical and computational efficiency.
Validates effectiveness on large language models with real datasets.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has shown remarkable success in aligning Large Language Models (LLMs) with human preferences. Traditional RLHF methods rely on a fixed dataset, which often suffers from limited coverage. To this end, online RLHF has emerged as a promising direction, enabling iterative data collection and refinement. Despite its potential, this paradigm faces a key bottleneck: the requirement to continuously integrate new data into the dataset and re-optimize the model from scratch at each iteration, resulting in computational and storage costs that grow linearly with the number of iterations. In this work, we address this challenge by proposing a one-pass reward modeling method that eliminates the need to store historical data and achieves constant-time updates per iteration. Specifically, we first formalize RLHF as a contextual preference bandit and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWater Systems and Optimization · Smart Grid Energy Management
