Semi-On-Policy Training for Sample Efficient Multi-Agent Policy Gradients
Bozhidar Vasilev, Tarun Gupta, Bei Peng, Shimon Whiteson

TL;DR
This paper introduces semi-on-policy training to improve sample efficiency of multi-agent policy gradient methods, achieving better performance on SMAC benchmark compared to existing approaches.
Contribution
The paper proposes semi-on-policy training to enhance multi-agent policy gradient algorithms, bridging the performance gap with value-based methods.
Findings
Significant performance improvements on SMAC tasks.
Enhanced policy gradient algorithms outperform state-of-the-art value-based methods.
Semi-on-policy training reduces sample inefficiency.
Abstract
Policy gradient methods are an attractive approach to multi-agent reinforcement learning problems due to their convergence properties and robustness in partially observable scenarios. However, there is a significant performance gap between state-of-the-art policy gradient and value-based methods on the popular StarCraft Multi-Agent Challenge (SMAC) benchmark. In this paper, we introduce semi-on-policy (SOP) training as an effective and computationally efficient way to address the sample inefficiency of on-policy policy gradient methods. We enhance two state-of-the-art policy gradient algorithms with SOP training, demonstrating significant performance improvements. Furthermore, we show that our methods perform as well or better than state-of-the-art value-based methods on a variety of SMAC tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Advanced Bandit Algorithms Research
