OLMES: A Standard for Language Model Evaluations
Yuling Gu, Oyvind Tafjord, Bailey Kuehl, Dany Haddad, Jesse Dodge,, Hannaneh Hajishirzi

TL;DR
OLMES is a comprehensive, open standard designed to improve the reproducibility and consistency of language model evaluations by standardizing various evaluation practices and supporting meaningful comparisons across models.
Contribution
This paper introduces OLMES, a standardized framework for language model evaluation that addresses variability and reproducibility issues in current practices.
Findings
OLMES enables consistent evaluation across different models and tasks.
It supports meaningful comparisons between models with different input formats.
The standard includes documented recommendations based on literature and new experiments.
Abstract
Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models can be particularly challenging, as choices of how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to claims about which models perform best not being reproducible. We propose OLMES, a completely documented, practical, open standard for reproducible LLM evaluations. In developing this standard, we identify and review the varying factors in evaluation practices adopted by the community - such as details of prompt formatting, choice of in-context examples, probability normalizations, and task formulation. In particular, OLMES supports meaningful comparisons between smaller base models that require…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗allenai/DataDecide-falcon-and-cc-qc-tulu-10p-60Mmodel· 31 dl· ♡ 131 dl♡ 1
- 🤗allenai/DataDecide-falcon-and-cc-qc-tulu-10p-90Mmodel· 27 dl27 dl
- 🤗allenai/DataDecide-falcon-and-cc-qc-tulu-10p-20Mmodel· 24 dl24 dl
- 🤗allenai/DataDecide-falcon-and-cc-qc-tulu-10p-4Mmodel· 9 dl9 dl
- 🤗allenai/DataDecide-dclm-baseline-qc-fw-3p-20Mmodel· 78 dl78 dl
- 🤗allenai/DataDecide-dclm-baseline-qc-fw-3p-4Mmodel· 91 dl91 dl
- 🤗allenai/DataDecide-dclm-baseline-qc-fw-3p-90Mmodel· 85 dl85 dl
- 🤗allenai/DataDecide-dclm-baseline-qc-fw-3p-60Mmodel· 88 dl88 dl
- 🤗allenai/DataDecide-falcon-60Mmodel· 31 dl31 dl
- 🤗allenai/DataDecide-falcon-90Mmodel· 29 dl29 dl
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsBalanced Selection
