Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

Toshinori Kitamura; Tadashi Kozuno; Wataru Kumagai; Kenta Hoshino; Yohei Hosoe; Kazumi Kasaura; Masashi Hamaya; Paavo Parmas; Yutaka Matsuo

arXiv:2408.16286·cs.LG·April 27, 2026

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

Toshinori Kitamura, Tadashi Kozuno, Wataru Kumagai, Kenta Hoshino, Yohei Hosoe, Kazumi Kasaura, Masashi Hamaya, Paavo Parmas, Yutaka Matsuo

PDF

1 Video

TL;DR

This paper introduces a novel algorithm for identifying near-optimal policies in robust constrained MDPs, overcoming limitations of traditional policy gradient methods by using an epigraph form and bisection search.

Contribution

It presents the first guaranteed algorithm for near-optimal policy identification in RCMDPs, utilizing epigraph reformulation and a bisection search approach.

Findings

01

The proposed algorithm guarantees $ ilde{O}(rac{1}{ ext{epsilon}^4})$ policy evaluations.

02

Conventional policy gradient methods can get trapped in suboptimal solutions due to conflicting gradients.

03

The epigraph form effectively resolves gradient conflicts in the RCMDP optimization.

Abstract

Designing a safe policy for uncertain environments is crucial in real-world control systems. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm guaranteed to identify a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional policy gradient approach to the Lagrangian max-min formulation can become trapped in suboptimal solutions. This occurs when its inner minimization encounters a sum of conflicting gradients from the objective and constraint functions. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form· slideslive