Delightful Distributed Policy Gradient

Ian Osband

arXiv:2603.20521·cs.LG·May 14, 2026

Delightful Distributed Policy Gradient

Ian Osband

PDF

TL;DR

The paper introduces Delightful Policy Gradient (DG), a novel method for distributed reinforcement learning that effectively handles high-surprisal data by gating updates with delight, improving sample efficiency and robustness.

Contribution

DG separates failures and successes using advantage and surprisal, suppressing failures without losing successes, and demonstrates superior performance in complex, real-world tasks.

Findings

01

DG outperforms importance-weighted PG on MNIST with staleness.

02

DG achieves nearly order-of-magnitude lower error on complex sequence tasks.

03

Sample efficiency of DG grows with task complexity.

Abstract

Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate finite-batch updates through large perpendicular components, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful Policy Gradient} (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and preserving rare successes without behavior probabilities. In a tabular analysis, DG suppresses the perpendicular second moment of high-surprisal failures by a policy-overlap factor that vanishes as the learner improves. The advantage sign is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.