Imitation-Regularized Offline Learning
Yifei Ma, Yu-Xiang Wang, Balakrishnan (Murali) Narayanaswamy

TL;DR
This paper introduces a novel offline learning approach combining policy improvement with imitation regularization to address challenges in policy evaluation and optimization when logged action probabilities are missing or unreliable.
Contribution
It proposes PIL-IML as an extension to Clipped-IPWE, providing lower-bound surrogates and connecting imitation regularization to variance estimation and natural policy gradients.
Findings
PIL-IML improves policy evaluation accuracy.
Regularization reduces variance in off-policy estimates.
Method performs well on simulated and real datasets.
Abstract
We study the problem of offline learning in automated decision systems under the contextual bandits model. We are given logged historical data consisting of contexts, (randomized) actions, and (nonnegative) rewards. A common goal is to evaluate what would happen if different actions were taken in the same contexts, so as to optimize the action policies accordingly. The typical approach to this problem, inverse probability weighted estimation (IPWE) [Bottou et al., 2013], requires logged action probabilities, which may be missing in practice due to engineering complications. Even when available, small action probabilities cause large uncertainty in IPWE, rendering the corresponding results insignificant. To solve both problems, we show how one can use policy improvement (PIL) objectives, regularized by policy imitation (IML). We motivate and analyze PIL as an extension to Clipped-IPWE,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
