VERINA: Benchmarking Verifiable Code Generation
Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song

TL;DR
VERINA is a comprehensive benchmark for evaluating verifiable code generation by large language models, highlighting current challenges and guiding future improvements in code correctness, specification accuracy, and proof generation.
Contribution
It introduces VERINA, a detailed, modular benchmark with 189 tasks for holistic evaluation of code, specifications, and proofs in verifiable code generation.
Findings
OpenAI o3 achieves 72.6% code correctness
52.3% for specification soundness and completeness
4.9% proof success rate
Abstract
Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often focus on only individual components rather than providing a holistic evaluation framework of all tasks. In this paper, we introduce VERINA (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. VERINA consists of 189 manually curated coding tasks in Lean, with detailed problem…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper makes a well-scoped and valuable contribution by introducing a unified benchmark for code, specification, and proof generation. 2. The proposed hybrid evaluation metric for SpecGen is a practical approach to a difficult problem, leveraging both formal proving and extensive testing to assess specification quality. 3. The paper provides a thorough experimental evaluation across a diverse set of models, including general-purpose LLMs, specialized provers, and an agentic baseline, offer
1. Gaps in Evaluation Metrics: The SpecGen evaluation logic is potentially not perfect. It uses negative test cases to evaluate both the soundness and completeness of different specification components, but does not appear to distinguish between tests that violate the pre-condition versus those that violate the post-condition. This can lead to noisy or incorrect soundness/completeness judgments. 2. The paper's claims of test suite adequacy rely on coverage metrics from Python reference implement
1). VERINA covers CodeGen, SpecGen, and ProofGen and allows their compositions, which matches realistic verification workflows rather than isolated tasks. 2). The formal treatment of soundness and completeness for pre- and post-conditions is crisp. The two-stage evaluator (prover first, then property-based testing) gives a practical way to score imperfectly decidable relationships and communicates uncertainty via “R might hold” and error bars.
1). Modest scale and representativeness are not convincing. With 189 single-file tasks, the benchmark remains small for broad performance claims. The focus on stand-alone snippets under-represents the multi-module, dependency-heavy settings that matter for real systems. The paper acknowledges this limitation but it still constrains the practical impact. 2). The “100% coverage” is established on Python references rather than the Lean artifacts under evaluation, due to missing Lean tooling. This
- Clear Presentation: The presentation is well-executed. - Extensive Task Support: VERINA comprehensively supports all three core tasks: CodeGen, SpecGen, and ProofGen, utilizing a modular and compositional approach. - Thorough Experimental Evaluation: The evaluation is extensive, involving 8 general-purpose and 3 specialized LLMs. It includes detailed ablations on the impact of contextual information, iterative refinement, and a breakdown of difficulty levels.
- Limited Language/Tool Coverage: The reliance on Lean significantly restricts the generalizability of the findings. The paper fails to justify how Lean-specific results would translate to other widely used verification ecosystems like Dafny, Verus, or Frama-C, which often feature different proof automation capabilities and specification styles. - Unclear Justification for Lean in Code Verification: Lean is primarily known for mathematical verification rather than production code verification. T
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Model-Driven Software Engineering Techniques · Machine Learning in Materials Science
