Proving the Coding Interview: A Benchmark for Formally Verified Code   Generation

Quinn Dougherty; Ronak Mehta

arXiv:2502.05714·cs.SE·February 11, 2025

Proving the Coding Interview: A Benchmark for Formally Verified Code Generation

Quinn Dougherty, Ronak Mehta

PDF

Open Access 1 Datasets

TL;DR

This paper introduces FVAPPS, a large benchmark dataset for formal verification of code, including program correctness proofs in Lean 4, to evaluate and advance automated theorem proving and program synthesis.

Contribution

It presents the FVAPPS benchmark with over 4700 samples for formal verification, expanding existing datasets to include theorem proving in Lean 4 and providing a new challenge for AI in code correctness.

Findings

01

Sonnet proves 30% of the theorems

02

Gemini proves 18% of the theorems

03

Benchmark is publicly available for research

Abstract

We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification benchmark, including 1083 curated and quality controlled samples. Previously, APPS provided a benchmark and dataset for programming puzzles to be completed in Python and checked against unit tests, of the kind seen in technical assessments in the software engineering industry. Building upon recent approaches for benchmarks in interactive theorem proving, we generalize the unit tests to Lean 4 theorems given without proof (i.e., using Lean's "sorry" keyword). On the 406 theorems of 100 randomly selected samples, Sonnet correctly proves 30% and Gemini correctly proves 18%. We challenge the machine learning and program synthesis communities to solve both each general purpose programming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

quinn-dougherty/fvapps
dataset· 8 dl
8 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Teaching and Learning Programming