Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework
Phalguni Nanda, Zaiwei Chen

TL;DR
This paper introduces doubly smoothed policy iteration (DSPI), a Bellman-operator framework that unifies natural policy gradient and other methods, proving convergence and complexity results for reinforcement learning algorithms.
Contribution
It presents DSPI, a novel Bellman-operator framework that generalizes natural policy gradient and dual averaging, with proven convergence and complexity guarantees.
Findings
DSPI includes policy iteration, natural policy gradient, and dual averaging as special cases.
Proves distribution-free global geometric convergence of DSPI.
Achieves iteration complexity of O((1−γ)^{-1} log((1−γ)^{-1} ε^{-1})) for ε-optimal policies.
Abstract
In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past -functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of for computing an -optimal policy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
