Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents
Kranti Chalamalasetti, Jana G\"otze, Sherzod Hakimov, Brielen, Madureira, Philipp Sadler, David Schlangen

TL;DR
This paper introduces Clembench, a framework that uses game-like settings to evaluate the capabilities of chat-optimized language models, demonstrating that newer models perform better in these interactive tasks.
Contribution
It proposes a novel, systematic method for evaluating LLMs through constrained game-like interactions, linking game performance to model development.
Findings
Newer models show improved game-play capabilities.
Performance correlates with model development cycle.
Metrics remain challenging, indicating diagnostic potential.
Abstract
Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"-agents that operate in rich linguistic and non-linguistic contexts-through testing them in carefully constructed interactive settings. Other recent work has argued that Large Language Models (LLMs), if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
