Dynamic Policy Programming
Mohammad Gheshlaghi Azar, Vicenc Gomez, Hilbert J. Kappen

TL;DR
This paper introduces dynamic policy programming (DPP), a new policy iteration method for infinite-horizon Markov decision processes, with proven performance bounds and superior empirical results over existing reinforcement learning methods.
Contribution
The paper proposes DPP, a novel policy iteration approach with new theoretical performance bounds and demonstrated empirical superiority over traditional RL algorithms.
Findings
DPP achieves better performance bounds than AVI and API.
Numerical experiments show DPP outperforms existing RL methods.
DPP effectively averages out simulation noise during learning.
Abstract
In this paper, we propose a novel policy iteration method, called dynamic policy programming (DPP), to estimate the optimal policy in the infinite-horizon Markov decision processes. We prove the finite-iteration and asymptotic l\infty-norm performance-loss bounds for DPP in the presence of approximation/estimation error. The bounds are expressed in terms of the l\infty-norm of the average accumulated error as opposed to the l\infty-norm of the error in the case of the standard approximate value iteration (AVI) and the approximate policy iteration (API). This suggests that DPP can achieve a better performance than AVI and API since it averages out the simulation noise caused by Monte-Carlo sampling throughout the learning process. We examine this theoretical results numerically by com- paring the performance of the approximate variants of DPP with existing reinforcement learning (RL)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Optimization and Search Problems
