Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis
Qining Zhang, Honghao Wei, Lei Ying

TL;DR
This paper introduces a model-free reinforcement learning algorithm, BSAD, that learns from human feedback without explicit reward inference, offering instance-dependent guarantees and potential performance improvements over reward inference methods.
Contribution
The paper presents BSAD, a novel model-free RLHF algorithm that directly identifies optimal policies from human preferences without reward inference, with provable sample complexity and adaptability to various settings.
Findings
BSAD achieves instance-dependent sample complexity similar to classic RL.
RLHF can be performed effectively without reward inference.
End-to-end RLHF may outperform reward inference methods.
Abstract
In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called , without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
