On the Complexity of Policy Iteration
Yishay Mansour, Satinder Singh

TL;DR
This paper establishes the first non-trivial worst-case upper bounds on the number of iterations policy iteration requires to find the optimal policy in Markov decision processes, independent of the discount factor.
Contribution
It provides the first analysis of policy iteration complexity with bounds that do not depend on the discount factor, enhancing understanding of its convergence behavior.
Findings
Derived the first non-trivial worst-case bounds for PI complexity
Showed PI's progression through policy space is better understood with these bounds
Bounded the number of iterations needed for PI to reach optimality
Abstract
Decision-making problems in uncertain or stochastic domains are often formulated as Markov decision processes (MDPs). Policy iteration (PI) is a popular algorithm for searching over policy-space, the size of which is exponential in the number of states. We are interested in bounds on the complexity of PI that do not depend on the value of the discount factor. In this paper we prove the first such non-trivial, worst-case, upper bounds on the number of iterations required by PI to converge to the optimal policy. Our analysis also sheds new light on the manner in which PI progresses through the space of policies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Formal Methods in Verification
