NeurIPS 2025 E2LM Competition : Early Training Evaluation of Language Models

Mouadh Yagoubi; Yasser Dahou; Billel Mokeddem; Younes Belkada; Phuc H. Le-Khac; Basma El Amel Boussaha; Reda Alami; Jingwei Zuo; Damiano Marsili; Mugariya Farooq; Mounia Lalmas; Georgia Gkioxari; Patrick Gallinari; Philip Torr; Hakim Hacid

arXiv:2506.07731·cs.AI·June 10, 2025

NeurIPS 2025 E2LM Competition : Early Training Evaluation of Language Models

Mouadh Yagoubi, Yasser Dahou, Billel Mokeddem, Younes Belkada, Phuc H. Le-Khac, Basma El Amel Boussaha, Reda Alami, Jingwei Zuo, Damiano Marsili, Mugariya Farooq, Mounia Lalmas, Georgia Gkioxari, Patrick Gallinari, Philip Torr, Hakim Hacid

PDF

Open Access

TL;DR

This paper introduces a competition focused on developing evaluation methods for assessing the early training progress of language models, addressing the limitations of existing benchmarks during initial training stages.

Contribution

It proposes a new challenge to design evaluation strategies tailored for early training, providing models and checkpoints to facilitate research and development.

Findings

01

Evaluation methods vary in effectiveness during early training

02

New benchmarks can better discriminate model performance early on

03

Participation is accessible with free cloud GPU resources

Abstract

Existing benchmarks have proven effective for assessing the performance of fully trained large language models. However, we find striking differences in the early training stages of small models, where benchmarks often fail to provide meaningful or discriminative signals. To explore how these differences arise, this competition tackles the challenge of designing scientific knowledge evaluation tasks specifically tailored for measuring early training progress of language models. Participants are invited to develop novel evaluation methodologies or adapt existing benchmarks to better capture performance differences among language models. To support this effort, we provide three pre-trained small models (0.5B, 1B, and 3B parameters), along with intermediate checkpoints sampled during training up to 200B tokens. All experiments and development work can be run on widely available free…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Materials Science · Topic Modeling