Vector preference-based contextual bandits under distributional shifts
Apurv Shukla, P.R. Kumar

TL;DR
This paper introduces a new approach for contextual bandit learning under distributional shifts, using preference cones and a novel regret measure, with theoretical guarantees on performance that extend previous results.
Contribution
It proposes an adaptive-discretization and optimistic elimination policy that self-tunes to distribution shifts and introduces preference-based regret for evaluating policy performance.
Findings
Regret bounds that generalize existing results to distribution shifts
Policy scales well with problem parameters under distributional shifts
The approach effectively adapts to changing reward distributions
Abstract
We consider contextual bandit learning under distribution shift when reward vectors are ordered according to a given preference cone. We propose an adaptive-discretization and optimistic elimination based policy that self-tunes to the underlying distribution shift. To measure the performance of this policy, we introduce the notion of preference-based regret which measures the performance of a policy in terms of distance between Pareto fronts. We study the performance of this policy by establishing upper bounds on its regret under various assumptions on the nature of distribution shift. Our regret bounds generalize known results for the existing case of no distribution shift and vectorial reward settings, and scale gracefully with problem parameters in presence of distribution shifts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
