Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents
Mahesh Ramesh, Kaousheik Jayakumar, Aswinkumar Ramkumar, Pavan Thodima, Aniket Rege, Emmanouil-Vasileios Vlatakis-Gkaragkounis

TL;DR
This paper evaluates large language models as strategic agents in Hanabi, demonstrating their ability to maintain internal memory and improve cooperative play through datasets and finetuning, yet still lag behind humans and specialized agents.
Contribution
It introduces new datasets, benchmarks, and finetuning methods for LLMs in Hanabi, advancing understanding of their cooperative reasoning capabilities.
Findings
LLMs can maintain internal working memory for state tracking.
Cross-play performance improves smoothly with model strength.
Finetuning significantly enhances cooperative Hanabi performance.
Abstract
Cooperative reasoning under incomplete information remains challenging for both humans and multi-agent systems. The card game Hanabi embodies this challenge, requiring theory-of-mind reasoning and strategic communication. We benchmark 17 state-of-the-art LLM agents in 2-5 player games and study the impact of context engineering across model scales (4B to 600B+) to understand persistent coordination failures and robustness to scaffolding: from a minimal prompt with only explicit card details (Watson setting), to scaffolding with programmatic, Bayesian-motivated deductions (Sherlock setting), to multi-turn state tracking via working memory (Mycroft setting). We show that (1) agents can maintain an internal working memory for state tracking and (2) cross-play performance between different LLMs smoothly interpolates with model strength. In the Sherlock setting, the strongest reasoning…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper is well-written and well-organized. 2. The paper provides a comprehensive benchmark, which includes several structured context designs, and the dataset could support future studies.
1. The analysis in this paper remains superficial, just showing the scores without the in-depth analysis. 2. The entire work is constrained to a single game domain and does not show how this benchmark generalizes to the real-world setting. 3. No mention of sampling temperature, top-k/p values or seed control. 4. It is unclear whether models play as independent agents or as a shared model controlling all players sequentially.
1. The authors are transparent about the metrics they report and the hyperparameters they use. 2. The plots are mostly well-formatted and easy to read. 3. The paper follows a logical flow and is easy to follow. 4. The experiments seem reasonable. 5. Evaluating LLMs on coordination and cooperation tasks is highly significant for the community, and reporting trustworthy results on a popular benchmark is very useful. Furthermore, fine-tuning data can also be highly impactful and useful.
## Clarity 1. Citation formatting is inconsistent. 2. Sometimes citations are missing, like for BAD and SAD. 3. The writing tense changes ## Quality 1. Overall, while the authors provide the standard deviations, they do not seem to account for them when making conclusions. It appears that most standard deviations heavily overlap, and it's unclear whether any of the conclusions hold up under a statistical significance test. Especially with such high variance, the authors might want to consider u
**Scale and Scope:** The evaluation is comprehensive, covering a wide range of recent LLMs (17 models), multiple player counts (2-5), and includes cross-play experiments. This provides a robust snapshot of current LLM capabilities in Hanabi. **Dataset Contributions:** HanabiLogs and HanabiRewards are significant contributions, providing richly annotated data with move utilities, which can fuel future research in instruction tuning and reinforcement learning for cooperative agents. The successfu
**Scaffolding Choices:** While insightful, the performance heavily depends on the chosen scaffolding (MinCon, DeductCon). It remains unclear why these specific strategies are chosen, and makes readers wonder about generalization to settings without such explicit support, different context, unseen partners and open-ended conventions. **Implicit State Tracking Analysis:** The Mycroft setting highlights difficulties with implicit state tracking, but the analysis doesn't quantify the source of erro
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
