AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Haoyu Zhao; Ziran Yang; Jiawei Li; Deyuan He; Zenan Li; Chi Jin; Venugopal V. Veeravalli; Aarti Gupta; Sanjeev Arora

arXiv:2602.09464·cs.SE·February 11, 2026

AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms

Haoyu Zhao, Ziran Yang, Jiawei Li, Deyuan He, Zenan Li, Chi Jin, Venugopal V. Veeravalli, Aarti Gupta, Sanjeev Arora

PDF

Open Access 1 Datasets

TL;DR

AlgoVeri introduces a comprehensive benchmark for evaluating AI-generated verified code across multiple classical algorithms and verification systems, revealing significant performance gaps and insights into the impact of language design on verification success.

Contribution

This work provides the first unified benchmark for cross-paradigm vericoding evaluation, enabling direct comparison of models across different verification languages and highlighting key capability gaps.

Findings

01

Frontier models perform best in Dafny with 40.3% success rate.

02

Performance drops significantly in Verus (24.7%) and Lean (7.8%).

03

Model behavior varies with language design, affecting refinement and error correction.

Abstract

Vericoding refers to the generation of formally verified code from rigorous specifications. Recent AI models show promise in vericoding, but a unified methodology for cross-paradigm evaluation is lacking. Existing benchmarks test only individual languages/tools (e.g., Dafny, Verus, and Lean) and each covers very different tasks, so the performance numbers are not directly comparable. We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean. By enforcing identical functional contracts, AlgoVeri reveals critical capability gaps in verification systems. While frontier models achieve tractable success in Dafny ( $40.3$ % for Gemini-3 Flash), where high-level abstractions and SMT automation simplify the workflow, performance collapses under the systems-level memory constraints of Verus ( $24.7$ %) and the explicit proof…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lizn-zn/algoveri-lean
dataset· 116 dl
116 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Formal Methods in Verification · Logic, programming, and type systems