Dueling Posterior Sampling for Preference-Based Reinforcement Learning

Ellen R. Novoseller; Yibing Wei; Yanan Sui; Yisong Yue; and Joel W.; Burdick

arXiv:1908.01289·cs.LG·June 30, 2020·6 cites

Dueling Posterior Sampling for Preference-Based Reinforcement Learning

Ellen R. Novoseller, Yibing Wei, Yanan Sui, Yisong Yue, and Joel W., Burdick

PDF

Open Access 1 Repo

TL;DR

This paper introduces DUELING POSTERIOR SAMPLING (DPS), a novel Bayesian approach for preference-based reinforcement learning that learns system dynamics and utility functions from trajectory preferences, providing the first regret guarantees in this setting.

Contribution

The paper presents DPS, a new Bayesian posterior sampling method for preference-based RL, with theoretical regret guarantees and empirical evaluation demonstrating competitive performance.

Findings

01

Proves asymptotic Bayesian no-regret rate for DPS.

02

Develops a Bayesian credit assignment approach for trajectory preferences.

03

Shows DPS performs competitively against existing methods.

Abstract

In preference-based reinforcement learning (RL), an agent interacts with the environment while receiving preferences instead of absolute feedback. While there is increasing research activity in preference-based RL, the design of formal frameworks that admit tractable theoretical analysis remains an open challenge. Building upon ideas from preference-based bandit learning and posterior sampling in RL, we present DUELING POSTERIOR SAMPLING (DPS), which employs preference-based posterior sampling to learn both the system dynamics and the underlying utility function that governs the preference feedback. As preference feedback is provided on trajectories rather than individual state-action pairs, we develop a Bayesian approach for the credit assignment problem, translating preferences to a posterior distribution over state-action reward models. We prove an asymptotic Bayesian no-regret rate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ernovoseller/DuelingPosteriorSampling
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms

MethodsLinear Regression