Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

I.S.W.B. Prasetya; Fitsum Kifetew; Davide Prandi

arXiv:2603.17193·cs.SE·March 19, 2026

Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

I.S.W.B. Prasetya, Fitsum Kifetew, Davide Prandi

PDF

Open Access

TL;DR

This study evaluates how well large language models can generate formal pre- and post-conditions from natural language descriptions, revealing strengths, weaknesses, and the importance of test validation.

Contribution

It provides a comprehensive benchmark of 24 LLMs on formal specification tasks and analyzes the impact of test-based validation on their accuracy.

Findings

01

LLMs can produce valid pre- and post-conditions but with notable errors.

02

Proprietary models tend to have fewer errors but higher false negatives.

03

Automated tests improve detection of incorrect solutions.

Abstract

Formal specifications, such as pre- and post-conditions provide a solid basis for performing thorough program verification. However, developers rarely provide such formal specifications, hence if AI could help in constructing them, it would make formal verification possible or at least make automated testing much more effective. This paper presents a study on the ability of LLMs in generating formal FULL pre- and post-conditions of a program, given its natural language description. 24 state-of-the-art LLMs were evaluated on a freshly prepared dataset of 40 tasks. The paper investigates specifications of varying difficulty and discusses a set of more refined performance metrics in addition the general accept@1 performance. It also investigates the impact of using automatically generated tests for validation of the solutions proposed by LLMs. The results of the experiment reveal that, in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Formal Methods in Verification · Software Engineering Research