Neural Theorem Proving for Verification Conditions: A Real-World Benchmark
Qiyuan Xu, Xiaokun Luan, Renxi Wang, Joshua Ong Jun Leang, Peixin Wang, Haonan Li, Wenda Li, Conrad Watt

TL;DR
This paper introduces a new benchmark for neural theorem proving applied to verification conditions in real-world software, evaluating large language models and highlighting current challenges in automated program verification.
Contribution
It presents the first real-world multi-language benchmark for neural theorem proving in program verification, derived from industrial projects and pipelines.
Findings
LLMs show potential in VC proving
Significant challenges remain for automated verification
Large gap between current capabilities and practical needs
Abstract
Theorem proving is fundamental to program verification, where the automated proof of Verification Conditions (VCs) remains a primary bottleneck. Real-world program verification frequently encounters hard VCs that existing Automated Theorem Provers (ATPs) cannot prove, leading to a critical need for extensive manual proofs that burden practical application. While Neural Theorem Proving (NTP) has achieved significant success in mathematical competitions, demonstrating the potential of machine learning approaches to formal reasoning, its application to program verification--particularly VC proving--remains largely unexplored. Despite existing work on annotation synthesis and verification-related theorem proving, no benchmark has specifically targeted this fundamental bottleneck: automated VC proving. This work introduces Neural Theorem Proving for Verification Conditions (NTP4VC),…
Peer Reviews
Decision·ICLR 2026 Poster
- Constructing a real-world program verification benchmark using VCs is well-motivated and the paper is well-written. - The benchmark construction is based on high-quality expert-written rules, which is reliable and one-time effort. - The NTP4VC benchmark strikes a good balance of two categories of challenging program verifications, specifically, small programming puzzles and real-world projects. Furthermore, the transient nature of VCs make potential contamination less of a concern. - Experim
- compared to programs and annotations at the source code level, VCs and relevant proofs are perhaps too low-level and less suitable for LLMs to generate proofs, since human experts rarely write proofs for VCs. Although NTP4VCs is an interesting and challenging benchmark, but may not be necessary to tackle directly by LLMs. The cost is also prohibitive, as indicated by the limited evaluation with GPT-o4-mini-high. - three themes of errors (presented in sec 5.2) are well expected; having a quant
1. **Benchmark contribution:** The authors introduce a new dataset specifically addressing VC proving, which is a practical bottleneck in automated program verification. 2. **Industrial relevance:** VCs are extracted from real-world industrial projects using established verification pipelines (Why3, Frama-C), increasing practical relevance. 3. **Multi-language and scenario coverage:** The dataset spans multiple proof assistant languages and aims to cover a variety of verification scenari
1. **Limited dataset novelty:** Although the focus on VCs is clear, the novelty is moderate. Previous datasets (e.g., CoqStop, FVEL, CoqGym) have also included verification-related theorems, even if the VC proportion was lower. The paper should better highlight what fundamentally differentiates NTP4VC—beyond just the percentage of VCs. 2. **Lack of technical contribution:** The work mainly involves dataset construction and benchmarking; there is no new technical advance. 3. **Provability n
1. The work is novel to the best of my knowledge, and addresses a fundamental gap: automating program verification for real-world complex programs. 2. Thorough evaluation of models and NTPs on the generated datasets. I also appreciate the datasets being created for different languages.
1. Lack of details on how the 5.3K VCs get filtered down to 627: I'm not sure why this was done, or what were the criteria for eliminating majority of the VCs. Were they too similar to each other, making them redundant? Or was it some deeper issue? Regardless, the elimination process was not described in the paper. Furthermore, out of the 5.3K, how many were extracted from the real-world programs and how many were extracted from the pearls of programs? 2. The pearls of programs: Does the inclusi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFormal Methods in Verification · Logic, programming, and type systems · Adversarial Robustness in Machine Learning
