VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

Yutong Xin; Qiaochu Chen; Greg Durrett; I\c{s}il Dillig

arXiv:2602.18307·cs.SE·February 23, 2026

VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

Yutong Xin, Qiaochu Chen, Greg Durrett, I\c{s}il Dillig

PDF

Open Access 1 Datasets

TL;DR

VeriSoftBench is a new benchmark of 500 Lean 4 proof obligations from software verification, highlighting challenges in proof automation transferability and the importance of repository context in proof success.

Contribution

Introduces VeriSoftBench, a repository-scale benchmark for Lean proofs in software verification, with insights on proof automation transferability and dependency context importance.

Findings

01

Provers tuned for Mathlib-style mathematics transfer poorly to repository-centric proofs.

02

Proof success correlates with the size of dependency closures.

03

Providing curated dependency context improves proof automation performance.

Abstract

Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

maxRyeery/VeriSoftBench
dataset· 18 dl
18 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, programming, and type systems · Mathematics, Computing, and Information Processing · Model-Driven Software Engineering Techniques