Dueling Posterior Sampling for Preference-Based Reinforcement Learning
Ellen R. Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel W., Burdick

TL;DR
This paper introduces DUELING POSTERIOR SAMPLING (DPS), a novel Bayesian approach for preference-based reinforcement learning that learns system dynamics and utility functions from trajectory preferences, providing the first regret guarantees in this setting.
Contribution
The paper presents DPS, a new Bayesian posterior sampling method for preference-based RL, with theoretical regret guarantees and empirical evaluation demonstrating competitive performance.
Findings
Proves asymptotic Bayesian no-regret rate for DPS.
Develops a Bayesian credit assignment approach for trajectory preferences.
Shows DPS performs competitively against existing methods.
Abstract
In preference-based reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preference-based RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preference-based bandit learning and posterior sampling in RL, we present DUELING POSTERIOR SAMPLING (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the preference feedback. As preference feedback is provided on trajectories rather than individual state-action pairs, we develop a Bayesian approach for the credit assignment problem, translating preferences to a posterior distribution over state-action reward models. We prove an asymptotic Bayesian no-regret rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms
MethodsLinear Regression
