Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri; Rahul Jain; Deepak Ramachandran; Zheng Wen

arXiv:2501.18873·cs.LG·April 23, 2026

Best Policy Learning from Trajectory Preference Feedback

Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen

PDF

TL;DR

This paper introduces PSPL, a Bayesian algorithm for optimal policy identification in preference-based reinforcement learning, addressing offline bias and online exploration challenges with theoretical guarantees and superior performance.

Contribution

It proposes PSPL, a novel Bayesian algorithm with regret guarantees for PbRL, combining offline preference data and online exploration, outperforming existing methods.

Findings

01

PSPL achieves lower simple regret in simulations.

02

The algorithm outperforms baselines on image generation benchmarks.

03

Provides the first Bayesian regret guarantees for PbRL.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has emerged as a powerful approach for aligning generative models, but its reliance on learned reward models makes it vulnerable to mis-specification and reward hacking. Preference-based Reinforcement Learning (PbRL) offers a more robust alternative by directly leveraging noisy binary comparisons over trajectories. We study the best policy identification problem in PbRL, motivated by post-training optimization of generative models, for example, during multi-turn interactions. Learning in this setting combines an offline preference dataset - potentially biased or out-of-distribution and collected from a rater of subpar `competence' - with online pure exploration, making systematic online learning essential. To this end, we propose Posterior Sampling for Preference Learning ( $PSPL$ ), a novel algorithm inspired by Top-Two Thompson…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.