Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization
I.S.W.B. Prasetya, Fitsum Kifetew, Davide Prandi

TL;DR
This study evaluates how well large language models can generate formal pre- and post-conditions from natural language descriptions, revealing strengths, weaknesses, and the importance of test validation.
Contribution
It provides a comprehensive benchmark of 24 LLMs on formal specification tasks and analyzes the impact of test-based validation on their accuracy.
Findings
LLMs can produce valid pre- and post-conditions but with notable errors.
Proprietary models tend to have fewer errors but higher false negatives.
Automated tests improve detection of incorrect solutions.
Abstract
Formal specifications, such as pre- and post-conditions provide a solid basis for performing thorough program verification. However, developers rarely provide such formal specifications, hence if AI could help in constructing them, it would make formal verification possible or at least make automated testing much more effective. This paper presents a study on the ability of LLMs in generating formal FULL pre- and post-conditions of a program, given its natural language description. 24 state-of-the-art LLMs were evaluated on a freshly prepared dataset of 40 tasks. The paper investigates specifications of varying difficulty and discusses a set of more refined performance metrics in addition the general accept@1 performance. It also investigates the impact of using automatically generated tests for validation of the solutions proposed by LLMs. The results of the experiment reveal that, in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Formal Methods in Verification · Software Engineering Research
