Vector preference-based contextual bandits under distributional shifts

Apurv Shukla; P.R. Kumar

arXiv:2508.15966·cs.LG·August 25, 2025

Vector preference-based contextual bandits under distributional shifts

Apurv Shukla, P.R. Kumar

PDF

TL;DR

This paper introduces a new approach for contextual bandit learning under distributional shifts, using preference cones and a novel regret measure, with theoretical guarantees on performance that extend previous results.

Contribution

It proposes an adaptive-discretization and optimistic elimination policy that self-tunes to distribution shifts and introduces preference-based regret for evaluating policy performance.

Findings

01

Regret bounds that generalize existing results to distribution shifts

02

Policy scales well with problem parameters under distributional shifts

03

The approach effectively adapts to changing reward distributions

Abstract

We consider contextual bandit learning under distribution shift when reward vectors are ordered according to a given preference cone. We propose an adaptive-discretization and optimistic elimination based policy that self-tunes to the underlying distribution shift. To measure the performance of this policy, we introduce the notion of preference-based regret which measures the performance of a policy in terms of distance between Pareto fronts. We study the performance of this policy by establishing upper bounds on its regret under various assumptions on the nature of distribution shift. Our regret bounds generalize known results for the existing case of no distribution shift and vectorial reward settings, and scale gracefully with problem parameters in presence of distribution shifts.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.