Multi-agent cooperation through learning-aware policy gradients
Alexander Meulemans, Seijin Kobayashi, Johannes von Oswald, Nino, Scherrer, Eric Elmoznino, Blake Richards, Guillaume Lajoie, Blaise Ag\"uera y, Arcas, Jo\~ao Sacramento

TL;DR
This paper introduces a novel unbiased policy gradient method for multi-agent reinforcement learning that enables self-interested agents to learn cooperative behaviors by modeling each other's learning dynamics, leading to improved cooperation in social dilemmas.
Contribution
It presents the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware multi-agent reinforcement learning, incorporating long observation histories for better cooperation.
Findings
Achieves cooperative behavior in standard social dilemmas
Demonstrates high returns in environments requiring action coordination
Provides a new explanation for cooperation emergence among learning agents
Abstract
Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper is well-written and well-organized. 2. Theoretical proofs are complete and sound.
1. The motivation for proposing COALA-PG is unclear. It’s not obvious whether the issue is related to variance or other issues when using mini-batches. The authors suggest that larger mini-batches could pose a problem, but this may lead to higher variance in reward summation. However, these points are not extensively discussed in the manuscript. Additionally, compared to M-FOS, it appears that COALA-PG uses the 1/B term to scale rewards, but it seems that this scaling is still related to control
1. **History-Dependent Adaptation for Multi-Agent Cooperation:** The paper introduces a promising approach that enables agents to adaptively cooperate by conditioning policy updates on observation histories. This allows agents to respond to non-stationarity in general-sum games, specifically handling the evolving distributions of co-agent strategies as each agent independently learns and adapts over time. By incorporating these historical observations, the framework aims to maintain effective co
1. **Full History Dependency:** While history dependency enables adaptability, it may approximate full observability, especially in discrete environments. By accumulating state-action information over time, agents in discrete settings could essentially reconstruct the environment as if fully observable, reducing the framework´s applicability in scenarios where true partial observability is intended. 2. **Simplistic Experimental Scenarios:** The chosen experiments, such as the Iterated Prisoner’s
1. Clear writing, clear concept definition. 2. extensive theoratic comparison with prior works and experiments on two non-trivial settings. 3. enough details of implementation in the appendix.
This may be difficult, but it would be great if you could show the efficiency of your method on more difficult environments of deep MARL beyond matrix game or grid world (like Agar.io[1]) [1]: Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization, ICLR 2021
Videos
Taxonomy
TopicsComplex Systems and Decision Making · Reinforcement Learning in Robotics · Game Theory and Applications
