Differentiable Bandit Exploration

Craig Boutilier; Chih-Wei Hsu; Branislav Kveton; Martin Mladenov,; Csaba Szepesvari; and Manzil Zaheer

arXiv:2002.06772·cs.LG·June 11, 2020·5 cites

Differentiable Bandit Exploration

Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov,, Csaba Szepesvari, and Manzil Zaheer

PDF

Open Access

TL;DR

This paper introduces a meta-learning approach for Bayesian bandit exploration policies, optimizing them via differentiable parameterizations and policy gradients, with theoretical guarantees and versatile experimental results.

Contribution

It proposes a differentiable, gradient-based method for learning exploration policies in Bayesian bandits, including novel variance reduction and a softmax policy with regret guarantees.

Findings

01

Neural network policies can learn implicit biases from instances.

02

The softmax policy has theoretical regret guarantees.

03

The approach is versatile and effective across different policy classes.

Abstract

Exploration policies in Bayesian bandits maximize the average reward over problem instances drawn from some distribution $P$ . In this work, we learn such policies for an unknown distribution $P$ using samples from $P$ . Our approach is a form of meta-learning and exploits properties of $P$ without making strong assumptions about its form. To do this, we parameterize our policies in a differentiable way and optimize them by policy gradients, an approach that is general and easy to implement. We derive effective gradient estimators and introduce novel variance reduction techniques. We also analyze and experiment with various bandit policy classes, including neural networks and a novel softmax policy. The latter has regret guarantees and is a natural starting point for our optimization. Our experiments show the versatility of our approach. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics

MethodsSoftmax