On the Performance Bounds of some Policy Search Dynamic Programming Algorithms
Bruno Scherrer (INRIA Nancy - Grand Est / LORIA)

TL;DR
This paper analyzes performance bounds of policy search algorithms in Markov Decision Processes, introducing a new algorithm that balances performance guarantees with computational efficiency.
Contribution
It provides new performance bounds for DPI and CPI, and introduces NSDPI, combining their advantages in terms of guarantees and complexity.
Findings
CPI has better performance guarantees than DPI but higher complexity.
NSDPI achieves similar guarantees to CPI with lower computational cost.
Performance bounds depend on concentrability constants in the algorithms.
Abstract
We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an -approximate greedy operator (Kakade and Langford, 2002; Lazaric et al., 2010). We describe existing and a few new performance bounds for Direct Policy Iteration (DPI) (Lagoudakis and Parr, 2003; Fern et al., 2006; Lazaric et al., 2010) and Conservative Policy Iteration (CPI) (Kakade and Langford, 2002). By paying a particular attention to the concentrability constants involved in such guarantees, we notably argue that the guarantee of CPI is much better than that of DPI, but this comes at the cost of a relative--exponential in -- increase of time complexity. We then describe an algorithm, Non-Stationary Direct Policy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics
