Is RLHF More Difficult than Standard RL?
Yuanhao Wang, Qinghua Liu, Chi Jin

TL;DR
This paper demonstrates that preference-based reinforcement learning can be effectively reduced to standard reward-based RL, making it theoretically and practically feasible with minimal additional costs.
Contribution
It provides a theoretical framework showing preference-based RL can be solved using existing reward-based RL algorithms across various models and preference types.
Findings
Preference-based RL can be reduced to reward-based RL with small or no extra costs.
Theoretical guarantees are provided for tabular and function approximation MDPs.
Reductions include robust reward RL and multiagent RL for different preference models.
Abstract
Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This paper theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games with a restricted set of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
