Improved and Generalized Upper Bounds on the Complexity of Policy Iteration
Bruno Scherrer (BIGS)

TL;DR
This paper provides improved upper bounds on the number of iterations for policy iteration algorithms in Markov Decision Processes, considering both discount factors and structural properties, demonstrating strong polynomiality under certain conditions.
Contribution
It introduces tighter bounds for Howard's and Simplex-PI, including discount-independent bounds based on structural properties, and extends results to broader classes of MDPs.
Findings
Howard's PI terminates after at most O(m/(1-γ) log(1/(1-γ))) iterations.
Simplex-PI terminates after at most O(nm/(1-γ) log(1/(1-γ))) iterations.
Under structural assumptions, Simplex-PI is strongly polynomial, and bounds are provided for both algorithms.
Abstract
Given a Markov Decision Process (MDP) with states and a totalnumber of actions, we study the number of iterations needed byPolicy Iteration (PI) algorithms to converge to the optimal-discounted policy. We consider two variations of PI: Howard'sPI that changes the actions in all states with a positive advantage,and Simplex-PI that only changes the action in the state with maximaladvantage. We show that Howard's PI terminates after at most iterations, improving by a factor a result by Hansen etal., while Simplex-PI terminates after at most iterations, improving by a factor a result by Ye. Undersome structural properties of the MDP, we then consider bounds thatare independent of the discount factor~:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
