Efficacy of Language Model Self-Play in Non-Zero-Sum Games
Austen Liao, Nicholas Tomlin, Dan Klein

TL;DR
This paper explores the use of self-play to improve language models in negotiation games, demonstrating significant performance gains and generalization to human collaboration across cooperative and competitive settings.
Contribution
It empirically shows that language models can effectively use self-play to improve in negotiation tasks and generalize to human interactions, even in cooperative scenarios.
Findings
Models improve 14-17x in task reward after self-play finetuning.
Trained models outperform base models 2.5-6x in human collaboration.
Self-play enhances language model performance in both cooperative and competitive settings.
Abstract
Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives and evaluate them in self-play and in collaboration with humans. We find that language models improve substantially in self-play, achieving 14-17x higher scores in task reward after…
Peer Reviews
Decision·Submitted to ICLR 2025
I really enjoy the research direction and perspective of the authors. Agentic LLms are becoming more popular and their interactions in non-cooperative environments are understudied, so this work has a novel and interesting approach. The authors do well to demonstrate where their self-play is effective, especially the result that it learns to reduce hallucinations. They also provide interesting analysis of human-AI play and different ways in which training affects results. There are a lot of int
The major weaknesses of this paper is the lack of explicit understanding of their own game dynamics, which are crucial for understanding expected behaviour, baselines on performance, and conclusions that can be drawn. This is likely because of missing insights from previous, non-LLM work. The overall story of the paper is also muddled and so it is unclear what are the main insights and whether the experiments support the conclusions. I believe this is fixable in the review period and give my rec
- A key innovation of this work is the self-play training on large language models, which I believe is a novel approach. While self-play has been successfully used in other domains with smaller models, applying it to LLMs in a negotiation game is interesting. This novel approach expands the scope of self-play, making it as a promising technique for training LLMs in complex, dialogue-based tasks. - The experimental results demonstrate substantial improvements, with the performance of the models i
- The study exclusively focuses on language models and does not incorporate any reinforcement learning (RL) baselines for comparison. Including RL-based models could have provided a broader benchmark, highlighting the unique contributions of self-play for LLMs while also revealing potential strengths or weaknesses relative to established RL techniques. - The self-play data generation and subsequent finetuning require substantial computational resources, resulting in high training costs. The auth
1. The paper provides extensive quantitative results demonstrating significant performance improvements. And results show models trained with self-play perform better in collaboration with humans. 2. Includes detailed analysis of errors, agreement rates, and Pareto optimality. 3. Provides detailed insights, such analyses of dialogue length and hallucination rates. 4. The paper is clearly presented and easy to follow.
1. In cooperative games involving humans, performance declines as the number of rounds increases from 8 to 10. Why does this occur? This trend is not seen in other game settings. 2. The experiments and analysis are limited to a single game, which could introduce bias. Evaluations across additional environments are needed for validation. 3. The study only considers cooperative and semi-competitive settings. Including a wider range of competitive and cooperative levels could provide deeper insight
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation
MethodsBalanced Selection
