Boosted Off-Policy Learning
Ben London, Levi Lu, Ted Sandler, Thorsten Joachims

TL;DR
This paper introduces a novel boosting algorithm tailored for off-policy learning from logged bandit feedback, directly optimizing expected reward and demonstrating strong theoretical and empirical performance.
Contribution
It presents the first boosting method for off-policy learning that guarantees exponential decrease in empirical risk and leverages supervised learning base learners.
Findings
Algorithm outperforms deep neural network methods in experiments.
Robustness to feature scaling and hyperparameter tuning.
Effective with decision trees as base learners.
Abstract
We propose the first boosting algorithm for off-policy learning from logged bandit feedback. Unlike existing boosting methods for supervised learning, our algorithm directly optimizes an estimate of the policy's expected reward. We analyze this algorithm and prove that the excess empirical risk decreases (possibly exponentially fast) with each round of boosting, provided a ''weak'' learning condition is satisfied by the base learner. We further show how to reduce the base learner to supervised learning, which opens up a broad range of readily available base learners with practical benefits, such as decision trees. Experiments indicate that our algorithm inherits many desirable properties of tree-based boosting algorithms (e.g., robustness to feature scaling and hyperparameter tuning), and that it can outperform off-policy learning with deep neural networks as well as methods that simply…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Smart Grid Energy Management · Reinforcement Learning in Robotics
MethodsBalanced Selection
