LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation

Dong Xu; Jialun Cao; Guozhao Mo; Junjie Hu; Cheng Wen; Hongyu Lin; Xianpei Han; Shengchao Qin; Cong Tian; Shing-Chi Cheung; Le Sun; Yaojie Lu

arXiv:2605.01394·cs.SE·May 5, 2026

LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation

Dong Xu, Jialun Cao, Guozhao Mo, Junjie Hu, Cheng Wen, Hongyu Lin, Xianpei Han, Shengchao Qin, Cong Tian, Shing-Chi Cheung, Le Sun, Yaojie Lu

PDF

1 Repo

TL;DR

This paper systematically evaluates LLM- and agent-based methods for formal specification generation in C programs, revealing their limitations and the impact of different prompting strategies.

Contribution

It introduces LiveFMBench, a new benchmark for evaluating specification generation, and provides a comprehensive analysis of LLM and agent performance and failure modes.

Findings

01

Naive evaluation overestimates model performance due to unfaithful behaviors.

02

Thinking mode and increased sampling improve success rates, especially for smaller models.

03

Agentic pipelines excel under low sampling budgets and on challenging datasets.

Abstract

Formal specification is essential for rigorous program verification, yet writing correct specifications remains costly and difficult to automate. Although large language models (LLMs) and agents have shown promising progress, their true capabilities and failure modes remain unclear. We present the first systematic and contamination-aware study of LLM- and agent-based formal specification generation for C programs. We introduce LiveFMBench, a continuously evolving benchmark of 630 ACSL (ANSI/ISO C Specification Language)-annotated C programs, including 360 newly collected cases designed to mitigate data leakage. Using this benchmark, we evaluate direct prompting with different sampling sizes, reasoning-enabled (thinking mode) inference, the agentic pipeline, and perform a fine-grained failure analysis. Experimental results reveal that naive evaluation substantially overestimates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://huggingface.co/datasets/fm-universe/Live-FM-Bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.