The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind

Andrei Lupu; Timon Willi; Jakob Foerster

arXiv:2506.20664·cs.AI·June 26, 2025

The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind

Andrei Lupu, Timon Willi, Jakob Foerster

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Decrypto is a novel interactive benchmark inspired by cognitive science, designed to evaluate multi-agent reasoning and theory of mind abilities in large language models, revealing current models' limitations compared to humans.

Contribution

It introduces the first interactive platform for ToM experiments, addressing gaps in existing benchmarks and providing comprehensive evaluation methods for LLMs in multi-agent scenarios.

Findings

01

LLMs lag behind humans in game-playing abilities.

02

State-of-the-art reasoning models perform worse than older models on ToM tasks.

03

Decrypto reveals significant gaps in current LLMs' multi-agent reasoning and ToM skills.

Abstract

As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multi-agent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, chief amongst them being theory of mind (ToM), or the ability to reason about the "mental" states of other agents. However, ToM and other multi-agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game-based benchmark for multi-agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- an interesting ToM setting with a clean, language-only, interactive testbed for multi-agent reasoning and ToM that avoids many common confounds - broad evaluations across model families/sizes; cooperative and competitive regimes; human-AI cross-play; prompt variants; also two adapted experiments (RC/FB and PT) provide diagnostic granularity

Weaknesses

- section 4 feels overlong and under-integrated with ToM claims. Much of it re-states known concepts (zero-shot, OOD) without showing how these choices sharpen or test ToM hypotheses. Lines 190–209 are especially verbose, and the assertions that “specialists can overfit DECRYPTO” are plausible but not verified in this setting; also, how do authors ensure some foundation models do not see DECRYPTO in their training? - the linkage between word-association mismatch and ToM is unclear. The paper att

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper studies a missing point in prior static or text-based benchmarks. By adapting the board game Decrypto, it provides a clear, interpretable, and engaging framework that tests pragmatic inference, cooperation, and competition among agents. 2. The benchmark is carefully designed to remove confounding factors such as mathematical, spatial, or symbolic reasoning, focusing purely on language-based reasoning and perspective-taking. 3. The paper is validated on extensive experiments involv

Weaknesses

While DECRYPTO is elegantly designed, it remains an artificial language game that may not fully capture the complexity or ambiguity of real-world multi-agent communication. The constrained, turn-based structure and reliance on predefined keywords could limit its ecological validity and applicability to open-ended human interactions.

Reviewer 03Rating 4Confidence 3

Strengths

- This work pick DECRYPTO , which is simple, to clearly investigate the LLM's performance on theory of mind. - Extensive experiments and analysis are executed to analyse LLMs' performance during competition, coordination and doing ToM, demonstrating useful conclusions.

Weaknesses

- The paper claim they are the first benchmark that designs interactive ToM experiments. In fact, there are some related works may have explored but not very comprehensively on this topic, e.g. MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration - The fixed roles (Encoder, Decoder, Interceptor) might limit the exploration of richer multi-agent dynamics (e.g., negotiation, deception, coalition).

Code & Models

Repositories

facebookresearch/decrypto
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsChaos-based Image/Signal Encryption