Delightful Policy Gradient

Ian Osband

arXiv:2603.14608·cs.LG·March 17, 2026

Delightful Policy Gradient

Ian Osband

PDF

Open Access

TL;DR

The paper introduces the Delightful Policy Gradient (DG), a new method that improves policy gradient updates by gating advantage signals with a delight measure, leading to better learning in reinforcement learning tasks.

Contribution

DG is a novel policy gradient method that uses a delight gating mechanism to improve directional accuracy and policy updates, outperforming existing methods.

Findings

01

DG provably improves directional accuracy in bandit settings.

02

Across multiple contexts, DG's expected gradient is closer to the supervised oracle.

03

Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines on various tasks.

Abstract

Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For $K$ -armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics