How Log-Barrier Helps Exploration in Policy Optimization

Leonardo Cesani; Matteo Papini; Marcello Restelli

arXiv:2603.15001·cs.LG·May 11, 2026

How Log-Barrier Helps Exploration in Policy Optimization

Leonardo Cesani, Matteo Papini, Marcello Restelli

PDF

TL;DR

This paper introduces a log-barrier regularization for stochastic gradient bandit algorithms to ensure exploration, providing theoretical guarantees and connecting it to natural policy gradients.

Contribution

It proposes the LB-SGB algorithm with a log-barrier regularizer, ensuring exploration without restrictive assumptions and linking it to natural policy gradient methods.

Findings

01

LB-SGB matches SGB's sample complexity

02

LB-SGB converges without assumptions on the learning process

03

Numerical simulations confirm benefits of log-barrier regularization

Abstract

Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process, namely that the probability of the optimal action is always bounded away from zero. We attribute this to the lack of an explicit exploration mechanism in SGB. To address these limitations, we propose to regularize the SGB objective with a log-barrier on the parametric policy, structurally enforcing a minimal amount of exploration. We prove that Log-Barrier Stochastic Gradient Bandit (LB-SGB) matches the sample complexity of SGB, but also converges (at a slower rate) without any assumptions on the learning process. We also show a connection between the log-barrier regularization and Natural Policy Gradient, as both exploit the geometry of the policy space…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.