TL;DR
CodeSpecBench is a new benchmark for evaluating how well large language models generate executable behavioral specifications, revealing significant challenges in capturing program semantics beyond code syntax.
Contribution
It introduces a comprehensive benchmark with execution-based evaluation for specification generation, highlighting the gap between code generation and semantic understanding in LLMs.
Findings
Best model achieves only 20.2% pass rate on repository-level tasks.
Specification generation is more challenging than code generation.
Strong coding performance does not imply deep semantic understanding.
Abstract
Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions, provide a concrete means to assess such understanding. However, existing work on specification generation is constrained in evaluation methodology, task settings, and specification expressiveness. We introduce CodeSpecBench, a benchmark for executable behavioral specification generation under an execution-based evaluation protocol. CodeSpecBench supports both function-level and repository-level tasks and encodes specifications as executable Python functions. Constructed from diverse real-world codebases, it enables a realistic assessment of both correctness (accepting valid behaviors) and completeness (rejecting invalid behaviors). Evaluating 15…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
