Efficacy of Language Model Self-Play in Non-Zero-Sum Games

Austen Liao; Nicholas Tomlin; Dan Klein

arXiv:2406.18872·cs.CL·December 10, 2024

Efficacy of Language Model Self-Play in Non-Zero-Sum Games

Austen Liao, Nicholas Tomlin, Dan Klein

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper explores the use of self-play to improve language models in negotiation games, demonstrating significant performance gains and generalization to human collaboration across cooperative and competitive settings.

Contribution

It empirically shows that language models can effectively use self-play to improve in negotiation tasks and generalize to human interactions, even in cooperative scenarios.

Findings

01

Models improve 14-17x in task reward after self-play finetuning.

02

Trained models outperform base models 2.5-6x in human collaboration.

03

Self-play enhances language model performance in both cooperative and competitive settings.

Abstract

Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives and evaluate them in self-play and in collaboration with humans. We find that language models improve substantially in self-play, achieving 14-17x higher scores in task reward after…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

I really enjoy the research direction and perspective of the authors. Agentic LLms are becoming more popular and their interactions in non-cooperative environments are understudied, so this work has a novel and interesting approach. The authors do well to demonstrate where their self-play is effective, especially the result that it learns to reduce hallucinations. They also provide interesting analysis of human-AI play and different ways in which training affects results. There are a lot of int

Weaknesses

The major weaknesses of this paper is the lack of explicit understanding of their own game dynamics, which are crucial for understanding expected behaviour, baselines on performance, and conclusions that can be drawn. This is likely because of missing insights from previous, non-LLM work. The overall story of the paper is also muddled and so it is unclear what are the main insights and whether the experiments support the conclusions. I believe this is fixable in the review period and give my rec

Reviewer 02Rating 6Confidence 4

Strengths

- A key innovation of this work is the self-play training on large language models, which I believe is a novel approach. While self-play has been successfully used in other domains with smaller models, applying it to LLMs in a negotiation game is interesting. This novel approach expands the scope of self-play, making it as a promising technique for training LLMs in complex, dialogue-based tasks. - The experimental results demonstrate substantial improvements, with the performance of the models i

Weaknesses

- The study exclusively focuses on language models and does not incorporate any reinforcement learning (RL) baselines for comparison. Including RL-based models could have provided a broader benchmark, highlighting the unique contributions of self-play for LLMs while also revealing potential strengths or weaknesses relative to established RL techniques. - The self-play data generation and subsequent finetuning require substantial computational resources, resulting in high training costs. The auth

Reviewer 03Rating 5Confidence 4

Strengths

1. The paper provides extensive quantitative results demonstrating significant performance improvements. And results show models trained with self-play perform better in collaboration with humans. 2. Includes detailed analysis of errors, agreement rates, and Pareto optimality. 3. Provides detailed insights, such analyses of dialogue length and hallucination rates. 4. The paper is clearly presented and easy to follow.

Weaknesses

1. In cooperative games involving humans, performance declines as the number of rounds increases from 8 to 10. Why does this occur? This trend is not seen in other game settings. 2. The experiments and analysis are limited to a single game, which could introduce bias. Evaluations across additional environments are needed for validation. 3. The study only considers cooperative and semi-competitive settings. Including a wider range of competitive and cooperative levels could provide deeper insight

Code & Models

Repositories

nickatomlin/lm-selfplay
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Multi-Agent Systems and Negotiation

MethodsBalanced Selection