Best Policy Learning from Trajectory Preference Feedback
Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

TL;DR
This paper introduces PSPL, a Bayesian algorithm for optimal policy identification in preference-based reinforcement learning, addressing offline bias and online exploration challenges with theoretical guarantees and superior performance.
Contribution
It proposes PSPL, a novel Bayesian algorithm with regret guarantees for PbRL, combining offline preference data and online exploration, outperforming existing methods.
Findings
PSPL achieves lower simple regret in simulations.
The algorithm outperforms baselines on image generation benchmarks.
Provides the first Bayesian regret guarantees for PbRL.
Abstract
Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset - potentially biased or out-of-distribution and collected from a rater of subpar `competence' - with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning (), a novel algorithm inspired by Top-Two Thompson…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
