Codenames as a Benchmark for Large Language Models
Matthew Stephenson, Matthew Sidji, Beno\^it Ronval

TL;DR
This paper introduces Codenames as a new benchmark for assessing the reasoning and language understanding of large language models, highlighting their strengths and limitations in gameplay scenarios.
Contribution
It proposes using Codenames as a novel benchmark for evaluating LLM reasoning, and analyzes the performance of various state-of-the-art models in this context.
Findings
Certain LLMs outperform others in gameplay
Models exhibit different emergent behaviors during play
LLM combinations improve generalizability
Abstract
In this paper, we propose the use of the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language-based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsLLaMA
