TL;DR
This paper introduces an online learning algorithm for distributionally robust multi-agent reinforcement learning, enabling agents to learn robust policies directly from environment interactions without prior data, with theoretical guarantees.
Contribution
It pioneers online learning in distributionally robust Markov games and proposes the MORNAVI algorithm with provable guarantees for robustness and efficiency.
Findings
Achieves low regret in robust policy learning.
Effectively handles uncertainties measured by TV and KL divergences.
Provides the first theoretical guarantees for online DRMG algorithms.
Abstract
Well-trained multi-agent systems can fail when deployed in real-world environments due to model mismatches between the training and deployment environments, caused by environment uncertainties including noise or adversarial attacks. Distributionally Robust Markov Games (DRMGs) enhance system resilience by optimizing for worst-case performance over a defined set of environmental uncertainties. However, current methods are limited by their dependence on simulators or large offline datasets, which are often unavailable. This paper pioneers the study of online learning in DRMGs, where agents learn directly from environmental interactions without prior data. We introduce the Multiplayer Optimistic Robust Nash Value Iteration (MORNAVI) algorithm and provide the first provable guarantees for this setting. Our theoretical analysis demonstrates that the algorithm achieves low regret and…
Peer Reviews
Decision·ICLR 2026 Poster
Originality The paper addresses a relatively unexplored problem: online distributionally robust Markov games (DRMGs) without access to simulators or offline data. While the formulation itself extends concepts familiar from single-agent robust RL and generative/offline DRMG studies, applying them to the online multi-agent regime is a natural but nontrivial step. The proposed MORNAVI framework—integrating optimism for exploration with robustness against uncertainty—is conceptually consistent with
1. Limited Conceptual Novelty Beyond Extension While the paper presents a rigorous treatment of online Distributionally Robust Markov Games (DRMGs), its conceptual novelty is limited. The proposed MORNAVI algorithm largely repackages existing principles—namely optimism in exploration and robust Bellman operators—previously developed in single-agent robust RL (e.g., Wang & Zou, NeurIPS 2021; Dong et al., ICML 2022; Panaganti & Kalathil, ICML 2022). Extending these to the multi-agent setting is a
+ The theoretical development is clear, with high-probability regret bounds for TV and KL uncertainty sets and sample-complexity corollaries to equilibrium under the NE, CE, and CCE. + The algorithmic design well-designed and motivated. It separates model estimation, robust optimistic planning with divergence-aware bonuses, and an equilibrium, and the mathematical treatment of support shift is interesting and well-written.
-- The paper lacks empirical validation. Although the theoretical results looks sound, there is a lack of experimental evidence that the proposed online method outperforms prior approaches or that the constants or overheads are practical. -- The practical comparison to generative or offline baselines and to out-of-distribution scenarios is unclear. It would help to quantify how the robust online procedure fares against strong non-robust or offline/generative methods on OOD tasks. -- While the
- The paper supplies both lower bounds (separations) and matching upper bounds (for TV and KL) together. The proof structure is standard but carefully adapted to the robust multi-agent setting. - The paper identifies and formalizes the online DRMG problem (vs. prior offline/generative-model work) and isolates two distinct hardness phenomena (support shift and curse-of-multi-agency).
- All upper bounds and the lower bounds include the product of agent action counts. This is a severe scalability concern (exponential in number of agents if each has many actions). The paper acknowledges this as an open question but does not give practical guidance or alleviate it. This limits real-world applicability. - The algorithm requires solving an equilibrium (Nash/CE/CCE) in the stagewise matrix game for each state and timestep. In practice large action space and many states make these s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
