Non-Rectangular Average-Reward Robust MDPs: Optimal Policies and Their Transient Values
Shengbo Wang, Nian Si

TL;DR
This paper investigates non-rectangular robust Markov decision processes under the average-reward criterion, providing theoretical insights into optimal policies, their transient performance, and a practical epoch-based policy with strong guarantees.
Contribution
It introduces a novel framework for non-rectangular robust MDPs, establishes existence of robust policies without rectangularity, and designs an epoch-based policy with guaranteed transient performance.
Findings
Robust optimal policies exist without rectangularity assumptions.
Average-reward optimality may conceal poor transient performance.
Proposed epoch-based policy achieves constant-order transient value.
Abstract
We study non-rectangular robust Markov decision processes under the average-reward criterion, where the ambiguity set couples transition probabilities across states and the adversary commits to a stationary kernel for the entire horizon. We show that any history-dependent policy achieving sublinear expected regret uniformly over the ambiguity set is robust-optimal, and that the robust value admits a minimax representation as the infimum over the ambiguity set of the classical optimal gains, without requiring any form of rectangularity or robust dynamic programming principle. Under the weak communication assumption, we establish the existence of such policies by converting high-probability regret bounds from the average-reward reinforcement learning literature into the expected-regret criterion. We then introduce a transient-value framework to evaluate finite-time performance of robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Game Theory and Applications
