Robust and Diverse Multi-Agent Learning via Rational Policy Gradient
Niklas Lauffer, Ameesh Shah, Micah Carroll, Sanjit A. Seshia, Stuart Russell, Michael Dennis

TL;DR
This paper introduces Rational Policy Gradient, a new adversarial optimization method that maintains agent rationality in cooperative multi-agent settings, enabling robust, diverse, and adaptable policies beyond zero-sum scenarios.
Contribution
The paper proposes Rationality-preserving Policy Optimization (RPO) and Rational Policy Gradient (RPG), novel algorithms that prevent self-sabotage in cooperative multi-agent adversarial training.
Findings
RPG extends adversarial algorithms to cooperative settings.
Our approach improves robustness and diversity of policies.
Empirical results show strong performance in various environments.
Abstract
Adversarial optimization algorithms that explicitly search for flaws in agents' policies have been successfully applied to finding robust and diverse policies in multi-agent settings. However, the success of adversarial optimization has been largely limited to zero-sum settings because its naive application in cooperative settings leads to a critical failure mode: agents are irrationally incentivized to self-sabotage, blocking the completion of tasks and halting further learning. To address this, we introduce Rationality-preserving Policy Optimization (RPO), a formalism for adversarial optimization that avoids self-sabotage by ensuring agents remain rational--that is, their policies are optimal with respect to some possible partner policy. To solve RPO, we develop Rational Policy Gradient (RPG), which trains agents to maximize their own reward in a modified version of the original game…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research
