Greedy Sampling Is Provably Efficient for RLHF

Di Wu; Chengshuai Shi; Jing Yang; Cong Shen

arXiv:2510.24700·cs.LG·October 29, 2025

Greedy Sampling Is Provably Efficient for RLHF

Di Wu, Chengshuai Shi, Jing Yang, Cong Shen

PDF

1 Video

TL;DR

This paper provides theoretical guarantees for the efficiency of greedy sampling in Reinforcement Learning from Human Feedback (RLHF), especially under general preference models, showing it can outperform traditional optimistic or pessimistic methods.

Contribution

It introduces performance guarantees for greedy sampling in RLHF under general preference models, revealing its sufficiency and advantages over existing methods.

Findings

01

Greedy sampling achieves provable efficiency in RLHF.

02

Theoretical improvements over existing algorithms with optimism or pessimism.

03

Specialization to the Bradley-Terry model highlights the effectiveness of greedy sampling.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Greedy Sampling Is Provably Efficient For RLHF· slideslive