SorryDB: Can AI Provers Complete Real-World Lean Theorems?
Austin Letson, Leopoldo Sarra, Auguste Poiroux, Oliver Dressler, Paul Lezeau, Dhyan Aranha, Frederick Pu, Aaron Hill, Miguel Corredera Hidalgo, Julian Berman, George Tsoukalas, Lenny Taelman

TL;DR
SorryDB is a continuously updated benchmark of real-world formalization tasks from GitHub, designed to evaluate and improve AI provers' ability to handle complex, practical mathematical proofs.
Contribution
The paper introduces SorryDB, a dynamic, real-world benchmark for formal mathematics, and evaluates various AI approaches on this evolving dataset.
Findings
Agentic approaches like Gemini Flash perform best but are not universally superior.
Current AI methods show complementary strengths in formal theorem proving.
SorryDB reduces test-set contamination and aligns evaluation with community needs.
Abstract
We present SorryDB, a dynamically-updating benchmark of open Lean tasks drawn from 78 real world formalization projects on GitHub. Unlike existing static benchmarks, often composed of competition problems, hillclimbing the SorryDB benchmark will yield tools that are aligned to the community needs, more usable by mathematicians, and more capable of understanding complex dependencies. Moreover, by providing a continuously updated stream of tasks, SorryDB mitigates test-set contamination and offers a robust metric for an agent's ability to contribute to novel formal mathematics projects. We evaluate a collection of approaches, including generalist large language models, agentic approaches, and specialized symbolic provers, over a selected snapshot of 1000 tasks from SorryDB. We show that current approaches are complementary: even though an agentic approach based on Gemini Flash is the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Logic, programming, and type systems · Mobile Crowdsensing and Crowdsourcing
