Delightful Gradients Accelerate Corner Escape

Jincheng Mei; Ian Osband

arXiv:2605.11908·cs.LG·May 13, 2026

Delightful Gradients Accelerate Corner Escape

Jincheng Mei, Ian Osband

PDF

TL;DR

Delightful Policy Gradient (DG) improves convergence by escaping sub-optimal corners faster than standard methods, with proven guarantees in bandits and MDPs, and promising empirical results on MNIST contextual bandits.

Contribution

The paper introduces DG, a novel policy gradient method that mitigates corner trapping, providing theoretical convergence guarantees and demonstrating practical benefits.

Findings

01

DG removes corner trapping near sub-optimal corners.

02

DG converges globally to the optimal policy at an O(1/t) rate.

03

In MNIST bandits, DG outperforms standard policy gradient in recovery speed.

Abstract

Softmax policy gradient converges at $O (1/ t)$ , but its transient behavior near sub-optimal corners of the simplex can be exponentially slow. The bottleneck is self-trapping: negative-advantage actions reinforce the corner policy and can initially push the optimal action backward. We study \emph{Delightful Policy Gradient} (DG), which gates each policy-gradient term by the product of advantage and action surprisal. For $K$ -armed bandits, we prove that the zero-temperature limit of DG removes this corner-trapping mechanism on a quantitative sector near any sub-optimal corner, yielding a first-exit escape bound logarithmic in the initial probability ratio. At every fixed temperature, the same local mechanism persists because harmful actions are polynomially suppressed as they become rare. A key structural insight is that every action better than the corner action is an \emph{ally}: its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.