Progress over Points: Reframing LM Benchmarks Around Scientific Objectives

Alwin Jin; Sean M. Hendryx; Vaskar Nath

arXiv:2512.11183·cs.LG·December 15, 2025

Progress over Points: Reframing LM Benchmarks Around Scientific Objectives

Alwin Jin, Sean M. Hendryx, Vaskar Nath

PDF

Open Access

TL;DR

This paper proposes a new, progress-oriented benchmark environment for language models that focuses on scientific objectives, exemplified by a NanoGPT speedrun environment, to better measure meaningful advancements in the field.

Contribution

It introduces a novel benchmark framework centered on scientific progress, standardizes evaluation tools, and demonstrates improved training efficiency and emergent algorithms.

Findings

01

Achieved a new state-of-the-art training time, reducing it by 3 seconds.

02

Observed emergence of novel algorithmic ideas during benchmarking.

03

Promotes a shift from static leaderboards to scientific progress measurement.

Abstract

Current benchmarks that test LLMs on static, already-solved problems (e.g., math word problems) effectively demonstrated basic capability acquisition. The natural progression has been toward larger, more comprehensive and challenging collections of static problems, an approach that inadvertently constrains the kinds of advances we can measure and incentivize. To address this limitation, we argue for progress-oriented benchmarks, problem environments whose objectives are themselves the core targets of scientific progress, so that achieving state of the art on the benchmark advances the field. As a introductory step, we instantiate an environment based on the NanoGPT speedrun. The environment standardizes a dataset slice, a reference model and training harness, and rich telemetry, with run-time verification and anti-gaming checks. Evaluation centers on the scientific delta achieved:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Natural Language Processing Techniques