clembench-2024: A Challenging, Dynamic, Complementary, Multilingual   Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

Anne Beyer; Kranti Chalamalasetti; Sherzod Hakimov; Brielen Madureira,; Philipp Sadler; David Schlangen

arXiv:2405.20859·cs.CL·June 3, 2024

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents

Anne Beyer, Kranti Chalamalasetti, Sherzod Hakimov, Brielen Madureira,, Philipp Sadler, David Schlangen

PDF

Open Access

TL;DR

This paper introduces clembench-2024, a flexible, multilingual benchmark framework for evaluating Large Language Models as multi-action agents through dynamic, interactive game-like environments, addressing current evaluation limitations.

Contribution

It presents a new adaptable framework for testing LLMs in interactive, multilingual settings, capable of evolving with new developments and avoiding data contamination.

Findings

01

Models perform below human levels in the benchmark.

02

The framework can adapt to new evaluation needs.

03

Prompting language significantly impacts model performance.

Abstract

It has been established in recent work that Large Language Models (LLMs) can be prompted to "self-play" conversational games that probe certain capabilities (general instruction following, strategic goal orientation, language understanding abilities), where the resulting interactive game play can be automatically scored. In this paper, we take one of the proposed frameworks for setting up such game-play environments, and further test its usefulness as an evaluation instrument, along a number of dimensions: We show that it can easily keep up with new developments while avoiding data contamination, we show that the tests implemented within it are not yet saturated (human performance is substantially higher than that of even the best models), and we show that it lends itself to investigating additional questions, such as the impact of the prompting language on performance. We believe that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques