Multi-agent cooperation through learning-aware policy gradients

Alexander Meulemans; Seijin Kobayashi; Johannes von Oswald; Nino; Scherrer; Eric Elmoznino; Blake Richards; Guillaume Lajoie; Blaise Ag\"uera y; Arcas; Jo\~ao Sacramento

arXiv:2410.18636·cs.AI·March 20, 2025

Multi-agent cooperation through learning-aware policy gradients

Alexander Meulemans, Seijin Kobayashi, Johannes von Oswald, Nino, Scherrer, Eric Elmoznino, Blake Richards, Guillaume Lajoie, Blaise Ag\"uera y, Arcas, Jo\~ao Sacramento

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a novel unbiased policy gradient method for multi-agent reinforcement learning that enables self-interested agents to learn cooperative behaviors by modeling each other's learning dynamics, leading to improved cooperation in social dilemmas.

Contribution

It presents the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware multi-agent reinforcement learning, incorporating long observation histories for better cooperation.

Findings

01

Achieves cooperative behavior in standard social dilemmas

02

Demonstrates high returns in environments requiring action coordination

03

Provides a new explanation for cooperation emergence among learning agents

Abstract

Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper is well-written and well-organized. 2. Theoretical proofs are complete and sound.

Weaknesses

1. The motivation for proposing COALA-PG is unclear. It’s not obvious whether the issue is related to variance or other issues when using mini-batches. The authors suggest that larger mini-batches could pose a problem, but this may lead to higher variance in reward summation. However, these points are not extensively discussed in the manuscript. Additionally, compared to M-FOS, it appears that COALA-PG uses the 1/B term to scale rewards, but it seems that this scaling is still related to control

Reviewer 02Rating 5Confidence 4

Strengths

1. **History-Dependent Adaptation for Multi-Agent Cooperation:** The paper introduces a promising approach that enables agents to adaptively cooperate by conditioning policy updates on observation histories. This allows agents to respond to non-stationarity in general-sum games, specifically handling the evolving distributions of co-agent strategies as each agent independently learns and adapts over time. By incorporating these historical observations, the framework aims to maintain effective co

Weaknesses

1. **Full History Dependency:** While history dependency enables adaptability, it may approximate full observability, especially in discrete environments. By accumulating state-action information over time, agents in discrete settings could essentially reconstruct the environment as if fully observable, reducing the framework´s applicability in scenarios where true partial observability is intended. 2. **Simplistic Experimental Scenarios:** The chosen experiments, such as the Iterated Prisoner’s

Reviewer 03Rating 8Confidence 2

Strengths

1. Clear writing, clear concept definition. 2. extensive theoratic comparison with prior works and experiments on two non-trivial settings. 3. enough details of implementation in the appendix.

Weaknesses

This may be difficult, but it would be great if you could show the efficiency of your method on more difficult environments of deep MARL beyond matrix game or grid world (like Agar.io[1]) [1]: Discovering Diverse Multi-Agent Strategic Behavior via Reward Randomization, ICLR 2021

Videos

Multi-agent cooperation through learning-aware policy gradients· slideslive

Taxonomy

TopicsComplex Systems and Decision Making · Reinforcement Learning in Robotics · Game Theory and Applications