Cooperative Multi-agent RL with Communication Constraints
Nuoya Xiong, Aarti Singh

TL;DR
This paper introduces a novel base policy prediction method for cooperative multi-agent reinforcement learning under communication constraints, significantly reducing communication rounds and sample complexity while ensuring convergence to equilibrium.
Contribution
It proposes a new technique called base policy prediction that improves learning efficiency in decentralized MARL with limited communication, outperforming existing methods.
Findings
Converges to an ε-Nash equilibrium with fewer communication rounds.
Reduces sample complexity without exponential dependence on action space.
Effective in both simulated games and complex environments.
Abstract
Cooperative MARL often assumes frequent access to global information in a data buffer, such as team rewards or other agents' actions, which is typically unrealistic in decentralized MARL systems due to high communication costs. When communication is limited, agents must rely on outdated information to estimate gradients and update their policies. A common approach to handle missing data is called importance sampling, in which we reweigh old data from a base policy to estimate gradients for the current policy. However, it quickly becomes unstable when the communication is limited (i.e. missing data probability is high), so that the base policy in importance sampling is outdated. To address this issue, we propose a technique called base policy prediction, which utilizes old gradients to predict the policy update and collect samples for a sequence of base policies, which reduces the gap…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The core idea-predicting future base policies via natural-policy-gradient-style updates and collecting data for that whole predicted set in a single communication round-is a clean variance control mechanism for importance sampling under staleness, and it is explicitly tied to the communication trigger condition. 2. The analysis for potential games gives simultaneous bounds on communication rounds $O\left(\varepsilon^{-3 / 4}\right)$ and samples $O\left(\operatorname{poly}\left(\max _i\left|A
1. The communication-trigger condition (two-part test on reward-difference and on elapsed steps) is chosen to bound IS variance, but the paper does not give a tight or adaptive rule showing this condition is near-optimal for a given environment. 2. The sample complexity still carries a relatively high exponent $\varepsilon^{-11 / 4}$; although better than some baselines, the proof does not clarify whether this exponent is an artifact of handling multiple predicted policies or is information-the
Originality: The Base Policy Prediction mechanism is a creative modification to classical importance sampling, introducing gradient-based prediction of base policies rather than relying on static ones. It bridges a clear gap between theoretical MARL under communication constraints and practical distributed implementations like MAPPO, a direction that has been rarely formalized. Quality: The sample and communication complexity improvements are significant. The empirical section validates theo
While the experiments show promising results, the paper could include quantitative comparisons of communication vs. performance trade-offs across a wider range of intervals. The SMAC and MPE experiments are promising but lack variance/error bars and statistical significance tests to confirm robustness. The Base Policy Prediction approach requires computing and storing multiple predicted policies per round. All agents are assumed homogeneous in terms of policy structure and reward access. Rea
Novel Theoretical Advancement: Introduces Base Policy Prediction, a modification to importance sampling that bridges the gap between outdated and current policies. Improved Efficiency: Achieves state-of-the-art results in both communication cost and sample complexity, removing the dependence on the joint action space size. Strong Theoretical Guarantees: Provides formal convergence proofs to an ε-Nash equilibrium and clear bounds on communication and sample complexity. Practical Validation: In
Dependence on Gradient Prediction Accuracy: The success of Base Policy Prediction heavily depends on the accuracy of old gradient estimates. Noisy or non-stationary environments could degrade performance. ε-Nash Equilibrium vs. Global Optimum: The algorithm converges to an ε-Nash equilibrium or local optimum, which is not necessarily the globally optimal cooperative solution.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Game Theory and Applications · Advanced Bandit Algorithms Research
