Re-evaluating Open-ended Evaluation of Large Language Models

Siqi Liu; Ian Gemp; Luke Marris; Georgios Piliouras; Nicolas Heess,; Marc Lanctot

arXiv:2502.20170·cs.GT·May 9, 2025

Re-evaluating Open-ended Evaluation of Large Language Models

Siqi Liu, Ian Gemp, Luke Marris, Georgios Piliouras, Nicolas Heess,, Marc Lanctot

PDF

Open Access 3 Reviews

TL;DR

This paper critiques current Elo-based open-ended evaluation methods for large language models, proposing a game-theoretic approach to improve robustness against data biases and redundancies, thereby offering more reliable model assessments.

Contribution

It introduces a novel 3-player game framework for LLM evaluation, addressing biases in existing rating systems with new solution concepts for enhanced robustness.

Findings

01

The proposed method yields more intuitive and reliable ratings.

02

It reveals insights into the competitive landscape of LLMs.

03

The approach mitigates bias amplification in model evaluation.

Abstract

Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

* The study of LLM evaluation addresses a promising and urgent need. The proposed method appears to be a reasonable approach for focusing on separability among LLMs, especially for those fine-tuned from the same foundational model. * Including the condition of prompts in the rating evaluation is a straightforward approach, and I believe this direction will soon gain more popularity. * The overall narrative is promising, supported by sufficient experiments that demonstrate the robustness of the e

Weaknesses

* The main idea is based on the fact that the importance of prompts in average cases may not be balanced; however, the proposed solution may lead to exploitation by niche prompts if LLMs share similar general abilities and some major prompts are marked as redundant. This approach lacks a mechanism to verify prompt redundancy. For a more formal evaluation, it would be better to provide some information on redundancy rather than directly applying the inferred results. * I found the use case of the

Reviewer 02Rating 6Confidence 2

Strengths

Strengths: + This paper has an abundance of citations and explains how the proposed work relates to those prior well. + There are several interesting, important results. Figure 2, specifically, provides a very clear indication of the utility of the proposed equilibrium ranking approach.

Weaknesses

Weaknesses - I found the paper a bit difficult to read. As the paper is a bit outside my exact expertise, this difficulty may be attributed to that. However, if other reviewers also found difficulty in understanding the exact details in the methodology section, I would recommend expanding this section with additional intuitive explanations and also expanding some of the descriptions in the appendix. - Figure captions could be improved by concluding with important takeaways.

Reviewer 03Rating 6Confidence 4

Strengths

This paper is good. Firstly, from a technical perspective, it introduces a 3-player game model based on game theory, providing equilibrium solutions applicable to N-player games. This is highly innovative in my opinion. Moreover, the paper conducts an analysis based on this paradigm, utilizing Affinity Entropy to meet assumptions and thereby addressing the limitations of traditional Elo evaluation methods in handling redundant data and biases. Secondly, there is a very close integration betwee

Weaknesses

There are a few points that confuse me: 1. In the background section, the introduction to NE and CCE seems somewhat limited, although this is a focal point of the paper. Insufficient explanation might lead to misunderstandings upon re-reading. LLE is also an important concept, yet it is only mentioned in section 3.2, causing some logical confusion. 2. In the paper, what specific meaning does "p" represent in the context of Affinity Entropy? Is it reasonable to directly set p=1 in Theorem 1? The

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification