CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

Zaoyu Chen; Jianbo Dai; Boyu Zhu; Jingdong Wang; Huiming Wang; Xin Xu; Haoyang Yuan; Zhijiang Guo; Xiao-Ming Wu

arXiv:2604.12268·cs.SE·April 15, 2026

CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

Zaoyu Chen, Jianbo Dai, Boyu Zhu, Jingdong Wang, Huiming Wang, Xin Xu, Haoyang Yuan, Zhijiang Guo, Xiao-Ming Wu

PDF

1 Repo

TL;DR

CodeSpecBench is a new benchmark for evaluating how well large language models generate executable behavioral specifications, revealing significant challenges in capturing program semantics beyond code syntax.

Contribution

It introduces a comprehensive benchmark with execution-based evaluation for specification generation, highlighting the gap between code generation and semantic understanding in LLMs.

Findings

01

Best model achieves only 20.2% pass rate on repository-level tasks.

02

Specification generation is more challenging than code generation.

03

Strong coding performance does not imply deep semantic understanding.

Abstract

Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions, provide a concrete means to assess such understanding. However, existing work on specification generation is constrained in evaluation methodology, task settings, and specification expressiveness. We introduce CodeSpecBench, a benchmark for executable behavioral specification generation under an execution-based evaluation protocol. CodeSpecBench supports both function-level and repository-level tasks and encodes specifications as executable Python functions. Constructed from diverse real-world codebases, it enables a realistic assessment of both correctness (accepting valid behaviors) and completeness (rejecting invalid behaviors). Evaluating 15…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SparksofAGI/CodeSpecBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.