Anti-Concentrated Confidence Bonuses for Scalable Exploration

Jordan T. Ash; Cyril Zhang; Surbhi Goel; Akshay Krishnamurthy; Sham; Kakade

arXiv:2110.11202·cs.LG·April 13, 2022

Anti-Concentrated Confidence Bonuses for Scalable Exploration

Jordan T. Ash, Cyril Zhang, Surbhi Goel, Akshay Krishnamurthy, Sham, Kakade

PDF

Open Access 1 Video

TL;DR

This paper introduces anti-concentrated confidence bonuses for scalable exploration in high-dimensional reinforcement learning, providing efficient approximations of elliptical bonuses and demonstrating competitive performance on Atari benchmarks.

Contribution

It proposes a novel approximation method for elliptical bonuses using ensemble regressors, enabling scalable exploration in high-dimensional settings.

Findings

01

Achieves $ ilde O(d \, \sqrt{T})$ regret bounds for linear bandits.

02

Develops a practical deep RL variant competitive with existing heuristics.

03

Demonstrates effectiveness on Atari benchmarks.

Abstract

Intrinsic rewards play a central role in handling the exploration-exploitation trade-off when designing sequential decision-making algorithms, in both foundational theory and state-of-the-art deep reinforcement learning. The LinUCB algorithm, a centerpiece of the stochastic linear bandits literature, prescribes an elliptical bonus which addresses the challenge of leveraging shared information in large action spaces. This bonus scheme cannot be directly transferred to high-dimensional exploration problems, however, due to the computational cost of maintaining the inverse covariance matrix of action features. We introduce \emph{anti-concentrated confidence bounds} for efficiently approximating the elliptical bonus, using an ensemble of regressors trained to predict random noise from policy network-derived features. Using this approximation, we obtain stochastic linear bandit algorithms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Anti-Concentrated Confidence Bonuses For Scalable Exploration· slideslive

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Age of Information Optimization