Reinforcement Learning from Human Feedback without Reward Inference:   Model-Free Algorithm and Instance-Dependent Analysis

Qining Zhang; Honghao Wei; Lei Ying

arXiv:2406.07455·cs.LG·January 22, 2025

Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis

Qining Zhang, Honghao Wei, Lei Ying

PDF

Open Access

TL;DR

This paper introduces a model-free reinforcement learning algorithm, BSAD, that learns from human feedback without explicit reward inference, offering instance-dependent guarantees and potential performance improvements over reward inference methods.

Contribution

The paper presents BSAD, a novel model-free RLHF algorithm that directly identifies optimal policies from human preferences without reward inference, with provable sample complexity and adaptability to various settings.

Findings

01

BSAD achieves instance-dependent sample complexity similar to classic RL.

02

RLHF can be performed effectively without reward inference.

03

End-to-end RLHF may outperform reward inference methods.

Abstract

In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model. We developed a model-free RLHF best policy identification algorithm, called $BSAD$ , without explicit reward model inference, which is a critical intermediate step in the contemporary RLHF paradigms for training large language models (LLM). The algorithm identifies the optimal policy directly from human preference information in a backward manner, employing a dueling bandit sub-routine that constantly duels actions to identify the superior one. $BSAD$ adopts a reward-free exploration and best-arm-identification-like adaptive stopping criteria to equalize the visitation among all states in the same decision step while moving to the previous step as soon as the optimal action is identifiable, leading to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics