Learning Robust Social Strategies with Large Language Models
Dereck Piche, Mohammed Muqeeth, Milad Aghajohari, Juan Duque, Michael Noukhovitch, Aaron Courville

TL;DR
This paper investigates multi-agent social dilemmas with large language models, demonstrating that standard reinforcement learning leads to self-interested behavior, and proposes an opponent-aware training method to promote cooperation and robustness.
Contribution
It introduces Advantage Alignment, an opponent-learning awareness algorithm adapted for LLMs, and a new social dilemma environment, Trust-and-Split, to improve multi-agent cooperation.
Findings
Advantage Alignment increases collective payoffs in social dilemmas.
RL-trained LLMs tend to develop opportunistic, exploitative behaviors.
The proposed method enhances robustness against greedy exploitation.
Abstract
As agentic AI becomes more widespread, agents with distinct and possibly conflicting goals will interact in complex ways. These multi-agent interactions pose a fundamental challenge, particularly in social dilemmas, where agents' individual incentives can undermine collective welfare. While reinforcement learning (RL) has been effective for aligning large language models (LLMs) in the single-agent regime, prior small-network results suggest that standard RL in multi-agent settings often converges to defecting, self-interested policies. We show the same effect in LLMs: despite cooperative priors, RL-trained LLM agents develop opportunistic behavior that can exploit even advanced closed-source models. To address this tendency of RL to converge to poor equilibria, we adapt a recent opponent-learning awareness algorithm, Advantage Alignment, to fine-tune LLMs toward multi-agent cooperation…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper targets a concrete and important failure mode in multi-agent LLM training: standard RL drives initially cooperative models toward greedy, Pareto-suboptimal policies. Analyzing this effect across several models establishes a meaningful risk for agentic LLM deployment. - The observation that standard RL induces greedy behavior across multiple open-source LLMs is consistently demonstrated. - The group-relative baseline provides a stable advantage estimator without requiring a critic. T
- The modification from $\sum_{k<t}$ to $\sum_{k \le t}$ in the AA term (Eq. 2) is motivated by partial observability. The original AA structure follows from causality: an agent’s action at time (t) can only affect an opponent’s future behavior. Including (k=t) implicitly assumes immediate influence on the opponent’s current action, which departs from the standard formulation. - Constrained decoding may confound agent behavior. Several settings enforce regex-constrained decoding to ensure valid
- As far as I know, this is the first application of opponent shaping to LM agents. - The paper builds on SOTA methods for opponent shaping. - Aside from a few points (see below), the paper is easy to follow. - The testbeds used are similar to existing results with traditional RL agents, allowing direct comparison.
- The paper is primarily replicating existing results from traditional RL agent training dynamics except using pretraining LMs as agents instead of randomly initialized feedforward networks. - The main method of the paper jit-AA is based on a very minor variation of an existing algorithm. Moreover, it’s not made clear why this variation is made, what benefits it enables over traditional AA, or why it performs better empirically. With experiments in just two very simple settings, the claims that
1. The paper builds a novel testbed for social dilemmas and finds that, despite having cooperative priors, current LLM agents still fail to adopt strategies that act in the collective interest. This highlights that current LLMs are not yet prepared to operate robustly in real-world multi-agent settings and points out a novel risk in current agentic AI. The proposed benchmark is meaningful and valuable. 2. The authors identify a hidden assumption in the original Advantage Alignment algorithm: "ag
1. The proposed jit-Advantage Alignment algorithm seems to offer little more than the original Advantage Alignment algorithm, except for considering the agent's advantage at the current time step. Therefore, its innovation feels somewhat limited. 2. The experiments mostly use models with fewer than 8B parameters or leverage weaker models to guide stronger models. While these weaker models may be tractable for isolated tasks, their priors might not be sufficient to consider collective welfare. Ho
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Explainable Artificial Intelligence (XAI)
