Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as   Conversational Agents

Kranti Chalamalasetti; Jana G\"otze; Sherzod Hakimov; Brielen; Madureira; Philipp Sadler; David Schlangen

arXiv:2305.13455·cs.CL·November 27, 2023·2 cites

Clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents

Kranti Chalamalasetti, Jana G\"otze, Sherzod Hakimov, Brielen, Madureira, Philipp Sadler, David Schlangen

PDF

Open Access 1 Repo

TL;DR

This paper introduces Clembench, a framework that uses game-like settings to evaluate the capabilities of chat-optimized language models, demonstrating that newer models perform better in these interactive tasks.

Contribution

It proposes a novel, systematic method for evaluating LLMs through constrained game-like interactions, linking game performance to model development.

Findings

01

Newer models show improved game-play capabilities.

02

Performance correlates with model development cycle.

03

Metrics remain challenging, indicating diagnostic potential.

Abstract

Recent work has proposed a methodology for the systematic evaluation of "Situated Language Understanding Agents"-agents that operate in rich linguistic and non-linguistic contexts-through testing them in carefully constructed interactive settings. Other recent work has argued that Large Language Models (LLMs), if suitably set up, can be understood as (simulators of) such agents. A connection suggests itself, which this paper explores: Can LLMs be evaluated meaningfully by exposing them to constrained game-like settings that are built to challenge specific capabilities? As a proof of concept, this paper investigates five interaction settings, showing that current chat-optimised LLMs are, to an extent, capable to follow game-play instructions. Both this capability and the quality of the game play, measured by how well the objectives of the different games are met, follows the development…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

clp-research/clembench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification