A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

David Schlangen; Sherzod Hakimov; Chalamalasetti Kranti; Jonathan Jordan; Philipp Sadler

arXiv:2507.08491·cs.CL·February 27, 2026

A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench

David Schlangen, Sherzod Hakimov, Chalamalasetti Kranti, Jonathan Jordan, Philipp Sadler

PDF

TL;DR

This paper introduces clembench, a new benchmark framework for dialogue game-based evaluation of large language models, combining control, ecological validity, and goal-directedness for more comprehensive assessment.

Contribution

It presents a mature, reusable implementation of dialogue game-based evaluation, enabling benchmarking and extension with custom tests for LLMs.

Findings

01

Clembench provides a standardized platform for dialogue game evaluation.

02

It allows easy benchmarking of models with pre-defined game instances.

03

The framework supports extension with new, targeted evaluation tests.

Abstract

There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.