Welfare Diplomacy: Benchmarking Language Model Cooperation
Gabriel Mukobi, Hannah Erlebach, Niklas Lauffer, Lewis Hammond, Alan, Chan, Jesse Clifton

TL;DR
This paper introduces Welfare Diplomacy, a new benchmark game for evaluating AI cooperation that balances military and domestic welfare, enabling better assessment and training of cooperative multi-agent systems.
Contribution
It proposes Welfare Diplomacy as a novel benchmark, implements it with open-source tools, and evaluates baseline language model agents on social welfare and exploitability.
Findings
High social welfare achieved by state-of-the-art models
Baseline agents are exploitable despite high welfare
Welfare Diplomacy provides clearer cooperation assessment
Abstract
The growing capabilities and increasingly widespread deployment of AI systems necessitate robust benchmarks for measuring their cooperative capabilities. Unfortunately, most multi-agent benchmarks are either zero-sum or purely cooperative, providing limited opportunities for such measurements. We introduce a general-sum variant of the zero-sum board game Diplomacy -- called Welfare Diplomacy -- in which players must balance investing in military conquest and domestic welfare. We argue that Welfare Diplomacy facilitates both a clearer assessment of and stronger training incentives for cooperative capabilities. Our contributions are: (1) proposing the Welfare Diplomacy rules and implementing them via an open-source Diplomacy engine; (2) constructing baseline agents using zero-shot prompted language models; and (3) conducting experiments where we find that baselines using state-of-the-art…
Peer Reviews
Decision·Submitted to ICLR 2024
Overall I think this is a great paper, the strengths can be addressed as follows: (1) The paper is clearly written and easy to follow. (2) The paper proposes a new environment variant to benchmark the agent cooperation ability and clearly illustrate the motivation. (3) The paper offers a theoretical analysis of its proposed environment and verifies the reasonability of the proposed environment. (4) The experiments successfully help benchmark the agent cooperation ability.
The weaknesses are summarized as follows: (1) The author can try to include more experiment results and ablation studies such as prompt sensitivity, hyperparameter effects, etc. (2) The author should try to incorporate human-LLM mixed experiments to see how human engagement can influence LLM performance. (3) Some human analysis of LLM's policy should be conducted to better understand LLM's performance.
1. The game of diplomacy is an important challenge in multi-agent research, and the concept of welfare diplomacy is interesting. 2. The paper effectively explains the differences between the proposed game and existing benchmarks. By making two modifications to the game rules, the nature of the game has been altered, incentivizing players to pursue peace and promoting cooperation. 3. The proposed game and prompts are open-sourced, and experimental results are extensive.
1. Some arguments regarding the motivations of welfare diplomacy lack rigor and may be questionable. It has been repeatedly claimed in the paper that "While Standard Diplomacy (SD) has features that make it interesting as an environment for cooperative AI research, it is zero-sum and incentivizes the development of cooperation-undermining capabilities" and `In contrast to SD, WD is general-sum'. However, it has been pointed out in [1] that "In Diplomacy, seven players... coordinate their action
1. The authors introduce Welfare Diplomacy (WD) and provide an implementation in an open-source Diplomacy library. 2. This paper provides theoretical and empirical evidence highlighting the benefits of WD compared to the existing benchmark, Zero-Sum Diplomacy (SD). 3. The authors develop a language model (LM) scaffolding system to create competent zero-shot baseline agents for WD.
1. Pareto-efficient equilibria are often not stable,and there may be various factors that can lead to deviations from the equilibrium, such as imperfect information, externalities, or strategic behavior. These deviations can disrupt the equilibrium and lead to a new outcome that is not Pareto-efficient. 2. It is challenging to attain Pareto-efficient equilibria, and how to achieve optimal Nash welfare remains unclear.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Economic Policies and Impacts · Reinforcement Learning in Robotics
