miniCodeProps: a Minimal Benchmark for Proving Code Properties
Evan Lohn, Sean Welleck

TL;DR
miniCodeProps is a new benchmark with 201 program specifications in Lean, designed to evaluate the ability of AI-based theorem provers to verify simple programs, revealing current limitations of neural provers on more complex proofs.
Contribution
The paper introduces miniCodeProps, a benchmark dataset for testing automated theorem proving on verified code, highlighting current neural provers' challenges with medium and hard properties.
Findings
State-of-the-art provers succeed on easy properties
Most medium and hard properties remain unproven by current methods
miniCodeProps is publicly available for future research
Abstract
AI agents have shown initial promise in automating mathematical theorem proving in proof assistants such as Lean. The same proof assistants can be used to verify the correctness of code by pairing code with specifications and proofs that the specifications hold. Automating the writing of code, specifications, and proofs could lower the cost of verification, or, ambitiously, enable an AI agent to output safe, provably correct code. However, it remains unclear whether current neural theorem provers can automatically verify even relatively simple programs. We present miniCodeProps, a benchmark of 201 program specifications in the Lean proof assistant, aimed at the subproblem of automatically generating a proof for a provided program and specification. miniCodeProps contains specifications about simple, self-contained programs (e.g., lists, natural numbers, binary trees) with varied proof…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification · Software Testing and Debugging Techniques · VLSI and Analog Circuit Testing
