DafnyBench: A Benchmark for Formal Software Verification

Chloe Loughridge; Qinyi Sun; Seth Ahrenbach; Federico Cassano; Chuyue; Sun; Ying Sheng; Anish Mudide; Md Rakib Hossain Misu; Nada Amin; Max Tegmark

arXiv:2406.08467·cs.SE·June 13, 2024·3 cites

DafnyBench: A Benchmark for Formal Software Verification

Chloe Loughridge, Qinyi Sun, Seth Ahrenbach, Federico Cassano, Chuyue, Sun, Ying Sheng, Anish Mudide, Md Rakib Hossain Misu, Nada Amin, Max Tegmark

PDF

Open Access 1 Repo

TL;DR

DafnyBench is a large benchmark designed to evaluate machine learning models' ability to assist formal software verification, demonstrating that LLMs can verify over 750 programs with a 68% success rate using hints and feedback.

Contribution

This paper introduces DafnyBench, the largest benchmark for training and evaluating ML systems in formal software verification, and assesses LLMs' effectiveness in generating hints for verification.

Findings

01

LLMs can verify over 750 programs with a 68% success rate.

02

Feedback from error messages improves verification success.

03

Verification success decreases as code complexity and hint requirements increase.

Abstract

We introduce DafnyBench, the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification. We test the ability of LLMs such as GPT-4 and Claude 3 to auto-generate enough hints for the Dafny formal verification engine to successfully verify over 750 programs with about 53,000 lines of code. The best model and prompting scheme achieved 68% success rate, and we quantify how this rate improves when retrying with error message feedback and how it deteriorates with the amount of required code and hints. We hope that DafnyBench will enable rapid improvements from this baseline as LLMs and verification techniques grow in quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sun-wendy/dafnybench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Formal Methods in Verification · Security and Verification in Computing

MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer