CLEVER: A Curated Benchmark for Formally Verified Code Generation

Amitayush Thakur; Jasper Lee; George Tsoukalas; Meghana Sistla; Matthew Zhao; Stefan Zetzsche; Greg Durrett; Yisong Yue; Swarat Chaudhuri

arXiv:2505.13938·cs.LG·October 24, 2025

CLEVER: A Curated Benchmark for Formally Verified Code Generation

Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri

PDF

Open Access 2 Repos 1 Datasets 1 Video

TL;DR

CLEVER is a curated benchmark of 161 verified code generation problems in Lean, designed to evaluate the ability of language models to produce correct, verifiable code without relying on test cases or leaked specifications.

Contribution

It introduces a high-quality, curated benchmark for verified code generation that avoids common pitfalls and provides post-hoc verification, setting a new standard for evaluating formal code synthesis.

Findings

01

State-of-the-art models struggle to fully verify generated code.

02

Benchmark reveals the difficulty of achieving complete formal verification.

03

Provides a challenging testbed for future research in verified program synthesis.

Abstract

We introduce $C LEVER$ , a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, $C LEVER$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use $C LEVER$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

amitayusht/clever
dataset· 45 dl
45 dl

Videos

CLEVER: A Curated Benchmark for Formally Verified Code Generation· slideslive

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Logic, programming, and type systems · Software Engineering Research