Direction-Aware Offline-to-Online Learning in Linear Contextual Bandits
Zean Han, Ruihan Lin, Zezhen Ding, Jiheng Zhang

TL;DR
This paper introduces a directional bias certificate for linear bandits that improves offline-to-online learning by adaptively exploiting historical data based on bias directions, leading to better regret bounds.
Contribution
It proposes a novel directional bias certificate and an algorithm that adaptively leverages offline data, improving regret bounds in linear bandit problems with biased offline data.
Findings
The algorithm matches standard regret rates when the bias certificate is known.
It improves regret when offline data aligns with low-bias directions.
Numerical experiments confirm theoretical advantages in aligned regimes.
Abstract
Many bandit systems are deployed with offline historical data, such as past logs from earlier policies. Using these data can reduce early online exploration when they remain informative for the online problem. When the offline and online environments differ, such data can be biased for the online problem. For linear (contextual) bandits, this bias is directional: offline data may be informative in some feature directions and misleading in others. However, prior work typically controls this gap through a known Euclidean bound on the model parameters, which we prove is too coarse: even with the offline parameter known, bias in a single unknown direction can force dimension-dependent regret. To address this challenge, we introduce a directional bias certificate that measures the offline-to-online gap through an -induced norm and assigns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
