Proving the Coding Interview: A Benchmark for Formally Verified Code Generation
Quinn Dougherty, Ronak Mehta

TL;DR
This paper introduces FVAPPS, a large benchmark dataset for formal verification of code, including program correctness proofs in Lean 4, to evaluate and advance automated theorem proving and program synthesis.
Contribution
It presents the FVAPPS benchmark with over 4700 samples for formal verification, expanding existing datasets to include theorem proving in Lean 4 and providing a new challenge for AI in code correctness.
Findings
Sonnet proves 30% of the theorems
Gemini proves 18% of the theorems
Benchmark is publicly available for research
Abstract
We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification benchmark, including 1083 curated and quality controlled samples. Previously, APPS provided a benchmark and dataset for programming puzzles to be completed in Python and checked against unit tests, of the kind seen in technical assessments in the software engineering industry. Building upon recent approaches for benchmarks in interactive theorem proving, we generalize the unit tests to Lean 4 theorems given without proof (i.e., using Lean's "sorry" keyword). On the 406 theorems of 100 randomly selected samples, Sonnet correctly proves 30% and Gemini correctly proves 18%. We challenge the machine learning and program synthesis communities to solve both each general purpose programming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Teaching and Learning Programming
