Evaluating LLM-Generated ACSL Annotations for Formal Verification
Arshad Beg, Diarmuid O'Donoghue, Rosemary Monahan

TL;DR
This paper empirically evaluates automated ACSL annotation methods for C programs, comparing rule-based, plugin, and large language model approaches in the context of formal verification.
Contribution
It provides a comparative analysis of LLM-based and rule-based annotation strategies for C program verification, highlighting current limitations and potential of LLMs.
Findings
Rule-based methods are more reliable for verification success.
LLM-based approaches show variable performance and limitations.
The study uses the CASP benchmark and Frama-C's WP plugin for evaluation.
Abstract
Formal specifications are crucial for building verifiable and dependable software systems, yet generating accurate and verifiable specifications for real-world C programs remains challenging. This paper presents an empirical evaluation of automated ACSL annotation generation strategies for C programs, comparing a rule-based Python script, Frama-C's RTE plugin, and three large language models (DeepSeek-V3.2, GPT-5.2, and OLMo 3.1 32B Instruct). The study focuses on one-shot annotation generation, assessing how these approaches perform when directly applied to verification tasks. Using a filtered subset of the CASP benchmark, we evaluate generated annotations through Frama-C's WP plugin with multiple SMT solvers, analyzing proof success rates, solver timeouts, and internal processing time. Our results show that rule-based approaches remain more reliable for verification success, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
