Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy
Alexander Duffy, Samuel J Paech, Ishana Shastri, Elizabeth Karpinski, Baptiste Alloui-Cros, Tyler Marques, Matthew Lyle Olson

TL;DR
This paper introduces an evaluation harness that allows any large language model to play full-press Diplomacy without fine-tuning, enabling broad assessment of strategic reasoning capabilities in LLMs.
Contribution
The work presents a novel, data-driven approach to represent game states for LLMs, facilitating out-of-the-box play of Diplomacy and providing tools for hypothesis testing and analysis.
Findings
Larger models perform better in Diplomacy tasks.
Smaller models still demonstrate adequate gameplay.
Critical State Analysis enables rapid, in-depth examination of key game moments.
Abstract
We present the first evaluation harness that enables any out-of-the-box, local, Large Language Models (LLMs) to play full-press Diplomacy without fine-tuning or specialized training. Previous work required frontier LLMs, or fine-tuning, due to the high complexity and information density of Diplomacy's game state. Combined with the high variance of matches, these factors made Diplomacy prohibitive for study. In this work, we used data-driven iteration to optimize a textual game state representation such that a 24B model can reliably complete matches without any fine tuning. We develop tooling to facilitate hypothesis testing and statistical analysis, and we present case studies on persuasion, aggressive playstyles, and performance across a range of models. We conduct a variety of experiments across many popular LLMs, finding the larger models perform the best, but the smaller models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEuropean Union Policy and Governance
