Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual   Bandit

Yichi Zhou; Shihong Song; Huishuai Zhang; Jun Zhu; Wei Chen; Tie-Yan; Liu

arXiv:2106.15128·cs.LG·June 30, 2021

Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual Bandit

Yichi Zhou, Shihong Song, Huishuai Zhang, Jun Zhu, Wei Chen, Tie-Yan, Liu

PDF

Open Access

TL;DR

This paper introduces ROFU, a novel OFU-based algorithm that efficiently balances exploration and exploitation in non-linear contextual bandits, including deep neural network settings, with theoretical guarantees and strong empirical performance.

Contribution

The paper proposes ROFU, an efficient and theoretically justified OFU algorithm for non-linear contextual bandits that extends to deep neural networks with gradient-based optimization.

Findings

01

ROFU achieves near-optimal regret bounds for various bandit models.

02

ROFU is computationally efficient and easily extends to deep neural networks.

03

Empirical results show ROFU performs well across different contextual bandit settings.

Abstract

Balancing exploration and exploitation (EE) is a fundamental problem in contex-tual bandit. One powerful principle for EE trade-off isOptimism in Face of Uncer-tainty(OFU), in which the agent takes the action according to an upper confidencebound (UCB) of reward. OFU has achieved (near-)optimal regret bound for lin-ear/kernel contextual bandits. However, it is in general unknown how to deriveefficient and effective EE trade-off methods for non-linearcomplex tasks, suchas contextual bandit with deep neural network as the reward function. In thispaper, we propose a novel OFU algorithm namedregularized OFU(ROFU). InROFU, we measure the uncertainty of the reward by a differentiable function andcompute the upper confidence bound by solving a regularized optimization prob-lem. We prove that, for multi-armed bandit, kernel contextual bandit and neuraltangent kernel bandit, ROFU achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms