How Log-Barrier Helps Exploration in Policy Optimization
Leonardo Cesani, Matteo Papini, Marcello Restelli

TL;DR
This paper introduces a log-barrier regularization for stochastic gradient bandit algorithms to ensure exploration, providing theoretical guarantees and connecting it to natural policy gradients.
Contribution
It proposes the LB-SGB algorithm with a log-barrier regularizer, ensuring exploration without restrictive assumptions and linking it to natural policy gradient methods.
Findings
LB-SGB matches SGB's sample complexity
LB-SGB converges without assumptions on the learning process
Numerical simulations confirm benefits of log-barrier regularization
Abstract
Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process, namely that the probability of the optimal action is always bounded away from zero. We attribute this to the lack of an explicit exploration mechanism in SGB. To address these limitations, we propose to regularize the SGB objective with a log-barrier on the parametric policy, structurally enforcing a minimal amount of exploration. We prove that Log-Barrier Stochastic Gradient Bandit (LB-SGB) matches the sample complexity of SGB, but also converges (at a slower rate) without any assumptions on the learning process. We also show a connection between the log-barrier regularization and Natural Policy Gradient, as both exploit the geometry of the policy space…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
