Policy Gradient for Robust Markov Decision Processes
Qiuhao Wang, Shaohang Xu, Chin Pang Ho, Marek Petrik

TL;DR
This paper introduces DRPMD, a novel policy gradient method with global optimality guarantees for robust MDPs, addressing model ambiguity and ensuring convergence in complex decision-making scenarios.
Contribution
The paper presents DRPMD, a new policy gradient algorithm with convergence guarantees for robust MDPs, including analysis, novel transition kernels, and empirical validation.
Findings
DRPMD guarantees convergence to a globally optimal policy.
Empirical results demonstrate robustness across various settings.
New parametric transition kernels extend applicability to continuous spaces.
Abstract
We develop a generic policy gradient method with the global optimality guarantee for robust Markov Decision Processes (MDPs). While policy gradient methods are widely used for solving dynamic decision problems due to their scalable and efficient nature, adapting these methods to account for model ambiguity has been challenging, often making it impractical to learn robust policies. This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (DRPMD), for solving robust MDPs. DRPMD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy. We provide a comprehensive analysis of DRPMD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications
MethodsSoftmax
