Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

Filippo Moment\`e; Alessandro Suglia; Mario Giulianelli; Ambra Ferrari; Alexander Koller; Oliver Lemon; David Schlangen; Raquel Fern\'andez; Raffaella Bernardi

arXiv:2502.14359·cs.CL·September 25, 2025

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests

Filippo Moment\`e, Alessandro Suglia, Mario Giulianelli, Ambra Ferrari, Alexander Koller, Oliver Lemon, David Schlangen, Raquel Fern\'andez, Raffaella Bernardi

PDF

Open Access

TL;DR

This paper compares benchmarks, games, and cognitive tests for evaluating LLMs, finding interactive games better discriminate model quality and proposing new cognitive tasks inspired by human assessments.

Contribution

It introduces a comprehensive evaluation framework combining benchmarks, games, and cognitive tests, highlighting the effectiveness of interactive games and proposing new targeted assessments for LLMs.

Findings

01

Interactive games outperform benchmarks in model discrimination

02

Causal and logical reasoning correlate with multiple test types

03

Social and emotional skills relate more to games

Abstract

We examine three evaluation paradigms: standard benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two-benchmarks or games-is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate for the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Natural Language Processing Techniques