Codenames as a Benchmark for Large Language Models

Matthew Stephenson; Matthew Sidji; Beno\^it Ronval

arXiv:2412.11373·cs.AI·April 23, 2025

Codenames as a Benchmark for Large Language Models

Matthew Stephenson, Matthew Sidji, Beno\^it Ronval

PDF

Open Access

TL;DR

This paper introduces Codenames as a new benchmark for assessing the reasoning and language understanding of large language models, highlighting their strengths and limitations in gameplay scenarios.

Contribution

It proposes using Codenames as a novel benchmark for evaluating LLM reasoning, and analyzes the performance of various state-of-the-art models in this context.

Findings

01

Certain LLMs outperform others in gameplay

02

Models exhibit different emergent behaviors during play

03

LLM combinations improve generalizability

Abstract

In this paper, we propose the use of the popular word-based board game Codenames as a suitable benchmark for evaluating the reasoning capabilities of Large Language Models (LLMs). Codenames presents a highly interesting challenge for achieving successful AI performance, requiring both a sophisticated understanding of language, theory of mind, and epistemic reasoning capabilities. Prior attempts to develop agents for Codenames have largely relied on word embedding techniques, which have a limited vocabulary range and perform poorly when paired with differing approaches. LLMs have demonstrated enhanced reasoning and comprehension capabilities for language-based tasks, but can still suffer in lateral thinking challenges. We evaluate the capabilities of several state-of-the-art LLMs, including GPT-4o, Gemini 1.5, Claude 3.5 Sonnet, and Llama 3.1, across a variety of board setups. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsLLaMA