Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

Yuqi Kong; Xiao Zhang; Weiran Shen

arXiv:2603.03778·cs.LG·March 5, 2026

Inverse Contextual Bandits without Rewards: Learning from a Non-Stationary Learner via Suffix Imitation

Yuqi Kong, Xiao Zhang, Weiran Shen

PDF

Open Access

TL;DR

This paper introduces a two-phase imitation framework for inverse contextual bandits that enables a passive observer to recover optimal policies from non-stationary action data without access to rewards, matching the efficiency of reward-aware learners.

Contribution

The paper proposes a novel Two-Phase Suffix Imitation method that effectively learns from non-stationary data in inverse bandit problems without reward information, providing theoretical guarantees.

Findings

01

Passive observer achieves $ ilde O(1/ oot N)$ convergence rate.

02

Framework handles non-stationary exploration-exploitation data.

03

Performance matches reward-aware learner asymptotically.

Abstract

We study the Inverse Contextual Bandit (ICB) problem, in which a learner seeks to optimize a policy while an observer, who cannot access the learner's rewards and only observes actions, aims to recover the underlying problem parameters. During the learning process, the learner's behavior naturally transitions from exploration to exploitation, resulting in non-stationary action data that poses significant challenges for the observer. To address this issue, we propose a simple and effective framework called Two-Phase Suffix Imitation. The framework discards data from an initial burn-in phase and performs empirical risk minimization using only data from a subsequent imitation phase. We derive a predictive decision loss bound that explicitly characterizes the bias-variance trade-off induced by the choice of burn-in length. Despite the severe information deficit, we show that a reward-free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms