OLMES: A Standard for Language Model Evaluations

Yuling Gu; Oyvind Tafjord; Bailey Kuehl; Dany Haddad; Jesse Dodge,; Hannaneh Hajishirzi

arXiv:2406.08446·cs.CL·February 12, 2025

OLMES: A Standard for Language Model Evaluations

Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge,, Hannaneh Hajishirzi

PDF

Open Access 10 Models 2 Datasets 1 Video

TL;DR

OLMES is a comprehensive, open standard designed to improve the reproducibility and consistency of language model evaluations by standardizing various evaluation practices and supporting meaningful comparisons across models.

Contribution

This paper introduces OLMES, a standardized framework for language model evaluation that addresses variability and reproducibility issues in current practices.

Findings

01

OLMES enables consistent evaluation across different models and tasks.

02

It supports meaningful comparisons between models with different input formats.

03

The standard includes documented recommendations based on literature and new experiments.

Abstract

Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models can be particularly challenging, as choices of how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to claims about which models perform best not being reproducible. We propose OLMES, a completely documented, practical, open standard for reproducible LLM evaluations. In developing this standard, we identify and review the varying factors in evaluation practices adopted by the community - such as details of prompt formatting, choice of in-context examples, probability normalizations, and task formulation. In particular, OLMES supports meaningful comparisons between smaller base models that require…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

OLMES: A Standard for Language Model Evaluations· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsBalanced Selection