Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
Shan Yang, Yang Liu

TL;DR
This paper introduces DG-PG, a novel policy gradient method that leverages differentiable models to reduce variance and improve scalability in cooperative multi-agent reinforcement learning.
Contribution
DG-PG integrates analytical models into policy updates, significantly reducing variance and enabling scalable learning in large multi-agent systems.
Findings
DG-PG reduces policy-gradient estimator variance from O(N) to O(1).
DG-PG converges within 20 episodes on a 1500-agent cloud scheduling task.
MAPPO and IPPO fail to converge under the same conditions.
Abstract
Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, each agent's learning signal is computed from a shared return that depends on all agents, so the stochasticity of the other agents enters the signal as cross-agent noise that grows with . Fortunately, many engineering systems, such as cloud computing and power systems, have differentiable analytical models that prescribe efficient system states, providing a new reference beyond noisy shared returns. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that augments policy-gradient updates with a noise-free descent signal derived from differentiable analytical models. We prove that DG-PG reduces policy-gradient estimator variance from to , preserves the equilibria of the cooperative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
