Elo Uncovered: Robustness and Best Practices in Language Model   Evaluation

Meriem Boubdir; Edward Kim; Beyza Ermis; Sara Hooker; Marzieh Fadaee

arXiv:2311.17295·cs.CL·November 30, 2023·5 cites

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation

Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker, Marzieh Fadaee

PDF

Open Access 1 Video

TL;DR

This paper critically examines the use of the Elo rating system for evaluating large language models, highlighting its limitations in reliability and transitivity, and proposes guidelines to improve evaluation robustness.

Contribution

It provides an in-depth analysis of Elo's behavior in LLM evaluation, revealing volatility issues and offering practical recommendations for more reliable assessment methods.

Findings

01

Elo scores can be volatile and inconsistent.

02

Varying Elo hyperparameters affects evaluation reliability.

03

Current Elo-based evaluations may not satisfy key axioms.

Abstract

In Natural Language Processing (NLP), the Elo rating system, originally designed for ranking players in dynamic games such as chess, is increasingly being used to evaluate Large Language Models (LLMs) through "A vs B" paired comparisons. However, while popular, the system's suitability for assessing entities with constant skill levels, such as LLMs, remains relatively unexplored. We study two fundamental axioms that evaluation methods should adhere to: reliability and transitivity. We conduct extensive evaluation of Elo behaviour, illustrating that individual Elo computations exhibit volatility and delving into the impact of varying the Elo rating system's hyperparameters. We show that these axioms are not always satisfied raising questions about the reliability of current comparative evaluations of LLMs. If the current use of Elo scores is intended to substitute the costly head-to-head…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation· slideslive

Taxonomy

TopicsSports Analytics and Performance · Topic Modeling · Explainable Artificial Intelligence (XAI)