Convergence of Fast Policy Iteration in Markov Games and Robust MDPs

Keith Badger; Jefferson Huang; Marek Petrik

arXiv:2508.06661·cs.GT·November 18, 2025

Convergence of Fast Policy Iteration in Markov Games and Robust MDPs

Keith Badger, Jefferson Huang, Marek Petrik

PDF

Open Access 1 Video

TL;DR

This paper critically examines the convergence properties of the Filar-Tolwinski algorithm in Markov games and robust MDPs, revealing its potential to fail and introducing a new, guaranteed-convergent method called RCPI.

Contribution

The paper identifies convergence issues in FT and proposes RCPI, a new algorithm that guarantees convergence and significantly outperforms existing methods.

Findings

01

FT may fail to converge and loop indefinitely.

02

RCPI guarantees convergence to a saddle point.

03

RCPI outperforms other algorithms by several orders of magnitude.

Abstract

Markov games and robust MDPs are closely related models that involve computing a pair of saddle point policies. As part of the long-standing effort to develop efficient algorithms for these models, the Filar-Tolwinski (FT) algorithm has shown considerable promise. As our first contribution, we demonstrate that FT may fail to converge to a saddle point and may loop indefinitely, even in small games. This observation contradicts the proof of FT's convergence to a saddle point in the original paper. As our second contribution, we propose Residual Conditioned Policy Iteration (RCPI). RCPI builds on FT, but is guaranteed to converge to a saddle point. Our numerical results show that RCPI outperforms other convergent algorithms by several orders of magnitude.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Convergence of Fast Policy Iteration in Markov Games and Robust MDPs· underline

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques