FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Nikil Ravi; Kexing Ying; Vasilii Nesterov; Rayan Krishnan; Elif Uskuplu; Bingyu Xia; Janitha Aswedige; Langston Nashold

arXiv:2603.26996·cs.AI·March 31, 2026

FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, Langston Nashold

PDF

TL;DR

FormalProofBench is a benchmark to evaluate AI models' ability to produce formally verified graduate-level mathematical proofs in Lean 4, revealing limited accuracy and providing detailed analysis of model performance.

Contribution

Introduces a new benchmark for formal theorem proving at graduate level and evaluates frontier models' capabilities and limitations in this domain.

Findings

01

Best model achieves 33.5% accuracy in formal proofs.

02

Performance drops rapidly after top-performing models.

03

Provides empirical analysis of tool use, failure modes, and costs.

Abstract

We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.