TL;DR
This paper provides theoretical guarantees for the efficiency of greedy sampling in Reinforcement Learning from Human Feedback (RLHF), especially under general preference models, showing it can outperform traditional optimistic or pessimistic methods.
Contribution
It introduces performance guarantees for greedy sampling in RLHF under general preference models, revealing its sufficiency and advantages over existing methods.
Findings
Greedy sampling achieves provable efficiency in RLHF.
Theoretical improvements over existing algorithms with optimism or pessimism.
Specialization to the Bradley-Terry model highlights the effectiveness of greedy sampling.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
