Global Optimality for Constrained Exploration via Penalty Regularization
Florian Wolf, Ilyas Fatkhullin, Niao He

TL;DR
This paper introduces PGP, a policy gradient method with penalty regularization for constrained exploration in reinforcement learning, providing global convergence guarantees and scalability to complex tasks.
Contribution
The paper proposes a novel single-loop policy gradient approach with quadratic-penalty regularization that guarantees global convergence for constrained exploration problems.
Findings
PGP achieves near-optimal constrained entropy values.
The method scales to continuous-control tasks.
Empirical validation on grid-world and continuous tasks.
Abstract
Efficient exploration is a central problem in reinforcement learning and is often formalized as maximizing the entropy of the state-action occupancy measure. While unconstrained maximum-entropy exploration is relatively well understood, real-world exploration is often constrained by safety, resource, or imitation requirements. This constrained setting is particularly challenging because entropy maximization lacks additive structure, rendering Bellman-equation-based methods inapplicable. Moreover, scalable approaches require policy parameterization, inducing non-convexity in both the objective and the constraints. To our knowledge, the only prior model-free policy-gradient approach for this setting under general policy parameterization is due to Ying et al. (2025). Unfortunately, their guarantees are limited to weak regret and ergodic averages, which do not imply that the final output is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
