MINIF2F-DAFNY: LLM-Guided Mathematical Theorem Proving via Auto-Active Verification
Mantas Baksys, Stefan Zetzsche, Olivier Bouissou, Sean B. Holden

TL;DR
This paper introduces MINIF2F-DAFNY, a benchmark translating mathematical problems to an auto-active verifier, demonstrating LLMs' effectiveness in guiding theorem proving with automation handling detailed proof steps.
Contribution
We present the first translation of miniF2F to an auto-active verifier, enabling evaluation of LLMs in guiding mathematical theorem proving within Dafny.
Findings
Dafny automates 39-44% of problems with empty proofs.
LLMs achieve 55.7% success rate in guiding proofs.
Effective division of labor: LLMs guide high-level reasoning, automation handles details.
Abstract
LLMs excel at reasoning, but validating their steps remains challenging. Formal verification offers a solution through mechanically checkable proofs. Interactive theorem provers (ITPs) dominate mathematical reasoning but require detailed low-level proof steps, while auto-active verifiers offer automation but focus on software verification. Recent work has begun bridging this divide by evaluating LLMs for software verification in ITPs, but the complementary direction--LLMs for mathematical theorem proving in auto-active verifiers--remains unexplored. We present MINIF2F-DAFNY, the first translation of the widely-used mathematical benchmark miniF2F to an auto-active verifier: Dafny. We find that Dafny's automation alone solves 39-44% of problems with empty proofs, whereas many require substantial proof guidance in ITPs. For remaining problems, we evaluate 7 off-the-shelf LLMs, achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, programming, and type systems · Formal Methods in Verification · Polynomial and algebraic computation
