FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

Gal Beniamini; Yuval Dor; Alon Vinnikov; Shir Granot Peled; Or Weinstein; Or Sharir; Noam Wies; Tomer Nussbaum; Ido Ben Shaul; Tomer Zekharya; Yoav Levine; Shai Shalev-Shwartz; Amnon Shashua

arXiv:2507.13337·cs.AI·July 18, 2025

FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming

Gal Beniamini, Yuval Dor, Alon Vinnikov, Shir Granot Peled, Or Weinstein, Or Sharir, Noam Wies, Tomer Nussbaum, Ido Ben Shaul, Tomer Zekharya, Yoav Levine, Shai Shalev-Shwartz, Amnon Shashua

PDF

Open Access

TL;DR

FormulaOne is a challenging benchmark for AI models, focusing on complex graph theory and logic problems rooted in real-world research, revealing current models' limitations in expert-level reasoning.

Contribution

The paper introduces FormulaOne, a novel, challenging benchmark based on real research problems in graph theory and logic, with implications for AI reasoning and theoretical computer science.

Findings

01

State-of-the-art models solve less than 1% of questions

02

The benchmark covers problems related to SETH and large-scale optimization

03

The dataset enables evaluation of advanced reasoning capabilities in AI

Abstract

Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human -- or superhuman -- expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, Reasoning, and Knowledge