Nearly optimal exploration-exploitation decision thresholds
Christos Dimitrakakis

TL;DR
This paper derives near-optimal decision thresholds for exploration and exploitation in reinforcement learning, linking planning horizon and uncertainty, and introduces a bagging approach for efficient posterior sampling.
Contribution
It presents explicit upper bounds for action utility in multi-armed bandits, generalizes Thompson sampling, and introduces bagging via online bootstrapping for reinforcement learning.
Findings
Proposed decision thresholds improve exploration-exploitation balance.
Experimental results show competitive performance with existing algorithms.
Introduced an efficient online bootstrapping method for posterior sampling.
Abstract
While in general trading off exploration and exploitation in reinforcement learning is hard, under some formulations relatively simple solutions exist. In this paper, we first derive upper bounds for the utility of selecting different actions in the multi-armed bandit setting. Unlike the common statistical upper confidence bounds, these explicitly link the planning horizon, uncertainty and the need for exploration explicit. The resulting algorithm can be seen as a generalisation of the classical Thompson sampling algorithm. We experimentally test these algorithms, as well as -greedy and the value of perfect information heuristics. Finally, we also introduce the idea of bagging for reinforcement learning. By employing a version of online bootstrapping, we can efficiently sample from an approximate posterior distribution.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Optimization and Search Problems
