Upper Bounds for All and Max-gain Policy Iteration Algorithms on Deterministic MDPs
Ritesh Goenka, Eashan Gupta, Sushil Khyalia, Pratyush Agarwal, Mulinti, Shaik Wajid, Shivaram Kalyanakrishnan

TL;DR
This paper establishes upper bounds on the running time of Policy Iteration algorithms, including max-gain variants, on deterministic MDPs, using graph-theoretic analysis to advance understanding of their efficiency.
Contribution
It provides the first non-trivial upper bounds for all PI algorithms on DMDPs and confirms a conjecture for Howard's PI in this setting.
Findings
Upper bounds for all PI algorithms on DMDPs
Upper bounds for max-gain switching variants
Confirmation of a conjecture for Howard's PI on DMDPs
Abstract
Policy Iteration (PI) is a widely used family of algorithms to compute optimal policies for Markov Decision Problems (MDPs). We derive upper bounds on the running time of PI on Deterministic MDPs (DMDPs): the class of MDPs in which every state-action pair has a unique next state. Our results include a non-trivial upper bound that applies to the entire family of PI algorithms; another to all "max-gain" switching variants; and affirmation that a conjecture regarding Howard's PI on MDPs is true for DMDPs. Our analysis is based on certain graph-theoretic results, which may be of independent interest.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification
