TL;DR
This paper systematically evaluates LLM- and agent-based methods for formal specification generation in C programs, revealing their limitations and the impact of different prompting strategies.
Contribution
It introduces LiveFMBench, a new benchmark for evaluating specification generation, and provides a comprehensive analysis of LLM and agent performance and failure modes.
Findings
Naive evaluation overestimates model performance due to unfaithful behaviors.
Thinking mode and increased sampling improve success rates, especially for smaller models.
Agentic pipelines excel under low sampling budgets and on challenging datasets.
Abstract
Formal specification is essential for rigorous program verification, yet writing correct specifications remains costly and difficult to automate. Although large language models (LLMs) and agents have shown promising progress, their true capabilities and failure modes remain unclear. We present the first systematic and contamination-aware study of LLM- and agent-based formal specification generation for C programs. We introduce LiveFMBench, a continuously evolving benchmark of 630 ACSL (ANSI/ISO C Specification Language)-annotated C programs, including 360 newly collected cases designed to mitigate data leakage. Using this benchmark, we evaluate direct prompting with different sampling sizes, reasoning-enabled (thinking mode) inference, the agentic pipeline, and perform a fine-grained failure analysis. Experimental results reveal that naive evaluation substantially overestimates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
