AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Ori Press; Brandon Amos; Haoyu Zhao; Yikai Wu; Samuel K. Ainsworth; Dominik Krupke; Patrick Kidger; Touqir Sajed; Bartolomeo Stellato; Jisun Park; Nathanael Bosch; Eli Meril; Albert Steppi; Arman Zharmagambetov; Fangzhao Zhang; David Perez-Pineiro; Alberto Mercurio; Ni Zhan; Talor Abramovich; Kilian Lieret; Hanlin Zhang; Shirley Huang; Matthias Bethge; Ofir Press

arXiv:2507.15887·cs.SE·October 27, 2025

AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?

Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan

PDF

Open Access 1 Datasets

TL;DR

This paper introduces AlgoTune, a benchmark and framework for testing language models' ability to generate efficient algorithms for complex scientific and mathematical problems, demonstrating modest speedups but limited innovation.

Contribution

It presents AlgoTune, a new benchmark with 154 tasks and a baseline LM agent, AlgoTuner, to evaluate and improve language models' algorithmic code generation capabilities.

Findings

01

AlgoTuner achieves 1.72x speedup over reference solvers.

02

Current models tend to optimize surface-level features rather than discover new algorithms.

03

The benchmark encourages development of more creative and effective algorithmic solutions.

Abstract

Despite progress in language model (LM) capabilities, evaluations have thus far focused on models' performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models' ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 154 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner uses a simple, budgeted loop that edits…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

oripress/AlgoTune
dataset· 14k dl
14k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Parallel Computing and Optimization Techniques