AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms
Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V. Veeravalli, Aarti Gupta, Sanjeev Arora

TL;DR
AlgoVeri introduces a comprehensive benchmark for evaluating AI-generated verified code across multiple classical algorithms and verification systems, revealing significant performance gaps and insights into the impact of language design on verification success.
Contribution
This work provides the first unified benchmark for cross-paradigm vericoding evaluation, enabling direct comparison of models across different verification languages and highlighting key capability gaps.
Findings
Frontier models perform best in Dafny with 40.3% success rate.
Performance drops significantly in Verus (24.7%) and Lean (7.8%).
Model behavior varies with language design, affecting refinement and error correction.
Abstract
Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny (% for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus (%) and the explicit proof…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Formal Methods in Verification · Logic, programming, and type systems
