
TL;DR
The paper introduces the Delightful Policy Gradient (DG), a new method that improves policy gradient updates by gating advantage signals with a delight measure, leading to better learning in reinforcement learning tasks.
Contribution
DG is a novel policy gradient method that uses a delight gating mechanism to improve directional accuracy and policy updates, outperforming existing methods.
Findings
DG provably improves directional accuracy in bandit settings.
Across multiple contexts, DG's expected gradient is closer to the supervised oracle.
Empirically, DG outperforms REINFORCE, PPO, and advantage-weighted baselines on various tasks.
Abstract
Standard policy gradients weight each sampled action by advantage alone, regardless of how likely that action was under the current policy. This creates two pathologies: within a single decision context (e.g. one image or prompt), a rare negative-advantage action can disproportionately distort the update direction; across many such contexts in a batch, the expected gradient over-allocates budget to contexts the policy already handles well. We introduce the \textit{Delightful Policy Gradient} (DG), which gates each term with a sigmoid of \emph{delight}, the product of advantage and action surprisal (negative log-probability). For -armed bandits, DG provably improves directional accuracy in a single context and, across multiple contexts, shifts the expected gradient strictly closer to the supervised cross-entropy oracle. This second effect is not variance reduction: it persists even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics
