
TL;DR
This paper introduces the Delightful Policy Gradient and Kondo gate to selectively perform backward passes based on a new delight measure, significantly reducing computation while maintaining learning quality, especially in complex tasks.
Contribution
The paper proposes a novel delight metric and the Kondo gate that efficiently gate backward passes, improving policy gradient training by balancing quality and computational cost.
Findings
Kondo gate effectively skips unnecessary backward passes.
Delight outperforms traditional value-surprisal measures as a screening signal.
Method scales well to complex tasks like transformer token reversal.
Abstract
Policy gradient computes a backward pass for every sample, even though the backward pass is expensive and most samples carry little learning value. The Delightful Policy Gradient (DG) provides a forward-pass signal of learning value: \emph{delight}, the product of advantage and surprisal (negative log-probability). We introduce the \emph{Kondo gate}, which compares delight against a compute price and pays for a backward pass only when the sample is worth it, thereby tracing a quality--cost Pareto frontier. In bandits, zero-price gating preserves useful gradient signal while removing perpendicular noise, and delight is a more reliable screening signal than additive combinations of value and surprise. On MNIST and transformer token reversal, the Kondo gate skips most backward passes while retaining nearly all of DG's learning quality, with gains that grow as problems get harder and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research · Adversarial Robustness in Machine Learning
