Delightful Gradients Accelerate Corner Escape
Jincheng Mei, Ian Osband

TL;DR
Delightful Policy Gradient (DG) improves convergence by escaping sub-optimal corners faster than standard methods, with proven guarantees in bandits and MDPs, and promising empirical results on MNIST contextual bandits.
Contribution
The paper introduces DG, a novel policy gradient method that mitigates corner trapping, providing theoretical convergence guarantees and demonstrating practical benefits.
Findings
DG removes corner trapping near sub-optimal corners.
DG converges globally to the optimal policy at an O(1/t) rate.
In MNIST bandits, DG outperforms standard policy gradient in recovery speed.
Abstract
Softmax policy gradient converges at , but its transient behavior near sub-optimal corners of the simplex can be exponentially slow. The bottleneck is self-trapping: negative-advantage actions reinforce the corner policy and can initially push the optimal action backward. We study \emph{Delightful Policy Gradient} (DG), which gates each policy-gradient term by the product of advantage and action surprisal. For -armed bandits, we prove that the zero-temperature limit of DG removes this corner-trapping mechanism on a quantitative sector near any sub-optimal corner, yielding a first-exit escape bound logarithmic in the initial probability ratio. At every fixed temperature, the same local mechanism persists because harmful actions are polynomially suppressed as they become rare. A key structural insight is that every action better than the corner action is an \emph{ally}: its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
