On the Performance Bounds of some Policy Search Dynamic Programming   Algorithms

Bruno Scherrer (INRIA Nancy - Grand Est / LORIA)

arXiv:1306.0539·cs.AI·June 4, 2013

On the Performance Bounds of some Policy Search Dynamic Programming Algorithms

Bruno Scherrer (INRIA Nancy - Grand Est / LORIA)

PDF

Open Access

TL;DR

This paper analyzes performance bounds of policy search algorithms in Markov Decision Processes, introducing a new algorithm that balances performance guarantees with computational efficiency.

Contribution

It provides new performance bounds for DPI and CPI, and introduces NSDPI, combining their advantages in terms of guarantees and complexity.

Findings

01

CPI has better performance guarantees than DPI but higher complexity.

02

NSDPI achieves similar guarantees to CPI with lower computational cost.

03

Performance bounds depend on concentrability constants in the algorithms.

Abstract

We consider the infinite-horizon discounted optimal control problem formalized by Markov Decision Processes. We focus on Policy Search algorithms, that compute an approximately optimal policy by following the standard Policy Iteration (PI) scheme via an -approximate greedy operator (Kakade and Langford, 2002; Lazaric et al., 2010). We describe existing and a few new performance bounds for Direct Policy Iteration (DPI) (Lagoudakis and Parr, 2003; Fern et al., 2006; Lazaric et al., 2010) and Conservative Policy Iteration (CPI) (Kakade and Langford, 2002). By paying a particular attention to the concentrability constants involved in such guarantees, we notably argue that the guarantee of CPI is much better than that of DPI, but this comes at the cost of a relative--exponential in $\frac{1}{ϵ}$ -- increase of time complexity. We then describe an algorithm, Non-Stationary Direct Policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics