Kernelized Offline Contextual Dueling Bandits

Viraj Mehta; Ojash Neopane; Vikramjeet Das; Sen Lin; Jeff; Schneider; Willie Neiswanger

arXiv:2307.11288·cs.LG·July 24, 2023

Kernelized Offline Contextual Dueling Bandits

Viraj Mehta, Ojash Neopane, Vikramjeet Das, Sen Lin, Jeff, Schneider, Willie Neiswanger

PDF

Open Access

TL;DR

This paper introduces an offline contextual dueling bandit framework that efficiently leverages context selection to optimize policy learning from human feedback, with theoretical guarantees and empirical validation.

Contribution

It proposes a novel offline contextual dueling bandit algorithm with an upper-confidence-bound approach and provides regret analysis and empirical performance comparison.

Findings

01

The algorithm achieves sublinear regret bounds.

02

It outperforms uniform context sampling strategies.

03

Empirical results confirm improved efficiency in policy identification.

Abstract

Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms