POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

Gehao Zhang; Juan Zhai

arXiv:2605.03356·cs.SE·May 6, 2026

POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference

Gehao Zhang, Juan Zhai

PDF

TL;DR

POSTCONDBENCH is a comprehensive multilingual benchmark for evaluating the accuracy and completeness of automatically generated postconditions in real-world software, addressing limitations of prior benchmarks.

Contribution

It introduces a large, real-world dataset with expert-verified postconditions, enabling automatic evaluation of postcondition generation methods across multiple settings.

Findings

01

Significant gap between correctness and completeness in current models

02

Repository dependencies and method complexity worsen generation quality

03

Evaluation on 5 SOTA LLMs highlights challenges in postcondition inference

Abstract

Formal postconditions precisely characterize program behavior and support debugging, testing, and verification, but writing them requires substantial expertise and effort. This has motivated recent work on automatically generating postconditions from code and natural-language artifacts using large language models (LLMs). However, evaluation remains a key bottleneck. Existing benchmarks primarily emphasize correctness under limited evaluation settings, often relying on surface-form matching or manual assessment on small or synthetic datasets. We introduce POSTCONDBENCH, a multilingual benchmark for evaluating method-level postcondition generation from real-world software. POSTCONDBENCH comprises 420 Python and Java tasks drawn from 121 open-source projects, each paired with a high-quality ground-truth postcondition set constructed with expert involvement. To enable automatic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.