FormulaOne: Measuring the Depth of Algorithmic Reasoning Beyond Competitive Programming
Gal Beniamini, Yuval Dor, Alon Vinnikov, Shir Granot Peled, Or Weinstein, Or Sharir, Noam Wies, Tomer Nussbaum, Ido Ben Shaul, Tomer Zekharya, Yoav Levine, Shai Shalev-Shwartz, Amnon Shashua

TL;DR
FormulaOne is a challenging benchmark for AI models, focusing on complex graph theory and logic problems rooted in real-world research, revealing current models' limitations in expert-level reasoning.
Contribution
The paper introduces FormulaOne, a novel, challenging benchmark based on real research problems in graph theory and logic, with implications for AI reasoning and theoretical computer science.
Findings
State-of-the-art models solve less than 1% of questions
The benchmark covers problems related to SETH and large-scale optimization
The dataset enables evaluation of advanced reasoning capabilities in AI
Abstract
Frontier AI models demonstrate formidable breadth of knowledge. But how close are they to true human -- or superhuman -- expertise? Genuine experts can tackle the hardest problems and push the boundaries of scientific understanding. To illuminate the limits of frontier model capabilities, we turn away from contrived competitive programming puzzles, and instead focus on real-life research problems. We construct FormulaOne, a benchmark that lies at the intersection of graph theory, logic, and algorithms, all well within the training distribution of frontier models. Our problems are incredibly demanding, requiring an array of reasoning steps. The dataset has three key properties. First, it is of commercial interest and relates to practical large-scale optimisation problems, such as those arising in routing, scheduling, and network design. Second, it is generated from the highly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, Reasoning, and Knowledge
