Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Shan Yang; Yang Liu

arXiv:2602.20078·cs.MA·May 6, 2026

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Shan Yang, Yang Liu

PDF

TL;DR

This paper introduces DG-PG, a novel policy gradient method that leverages differentiable models to reduce variance and improve scalability in cooperative multi-agent reinforcement learning.

Contribution

DG-PG integrates analytical models into policy updates, significantly reducing variance and enabling scalable learning in large multi-agent systems.

Findings

01

DG-PG reduces policy-gradient estimator variance from O(N) to O(1).

02

DG-PG converges within 20 episodes on a 1500-agent cloud scheduling task.

03

MAPPO and IPPO fail to converge under the same conditions.

Abstract

Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise. When agents share a common reward, each agent's learning signal is computed from a shared return that depends on all agents, so the stochasticity of the other agents enters the signal as cross-agent noise that grows with $N$ . Fortunately, many engineering systems, such as cloud computing and power systems, have differentiable analytical models that prescribe efficient system states, providing a new reference beyond noisy shared returns. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that augments policy-gradient updates with a noise-free descent signal derived from differentiable analytical models. We prove that DG-PG reduces policy-gradient estimator variance from $O (N)$ to $O (1)$ , preserves the equilibria of the cooperative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.