The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

Bingchen Zhao; Despoina Magka; Minqi Jiang; Xian Li; Roberta Raileanu; Tatiana Shavrina; Jean-Christophe Gagnon-Audet; Kelvin Niu; Shagun Sodhani; Michael Shvartsman; Andrei Lupu; Alisia Lupidi; Edan Toledo; Karen Hambardzumyan; Martin Josifoski; Thomas Foster; Lucia Cipolina-Kun; Abhishek Charnalia; Derek Dunfield; Alexander H. Miller; Oisin Mac Aodha; Jakob Foerster; Yoram Bachrach

arXiv:2506.22419·cs.AI·July 2, 2025

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

Bingchen Zhao, Despoina Magka, Minqi Jiang, Xian Li, Roberta Raileanu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Kelvin Niu, Shagun Sodhani, Michael Shvartsman, Andrei Lupu, Alisia Lupidi, Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Thomas Foster

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces an automated benchmark to evaluate AI agents' ability to reproduce and improve LLM training results, highlighting current limitations of reasoning LLMs in replicating known innovations.

Contribution

The paper presents the Automated LLM Speedrunning Benchmark, a realistic and accessible test for AI agents to reproduce and optimize LLM training improvements.

Findings

01

Recent reasoning LLMs struggle to reimplement known innovations

02

The benchmark covers diverse code-level improvements

03

It provides a measure of AI's ability to automate scientific reproduction

Abstract

Rapid advancements in large language models (LLMs) have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/llm-speedrunner
pytorchOfficial

Videos

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements· slideslive

Taxonomy

TopicsMachine Learning in Materials Science · Artificial Intelligence in Healthcare and Education · Scientific Computing and Data Management

MethodsDropout · Refunds@Expedia|||How do I get a full refund from Expedia? · GPT-2 · Hierarchical Information Threading