POSTCONDBENCH: Benchmarking Correctness and Completeness in Formal Postcondition Inference
Gehao Zhang, Juan Zhai

TL;DR
POSTCONDBENCH is a comprehensive multilingual benchmark for evaluating the accuracy and completeness of automatically generated postconditions in real-world software, addressing limitations of prior benchmarks.
Contribution
It introduces a large, real-world dataset with expert-verified postconditions, enabling automatic evaluation of postcondition generation methods across multiple settings.
Findings
Significant gap between correctness and completeness in current models
Repository dependencies and method complexity worsen generation quality
Evaluation on 5 SOTA LLMs highlights challenges in postcondition inference
Abstract
Formal postconditions precisely characterize program behavior and support debugging, testing, and verification, but writing them requires substantial expertise and effort. This has motivated recent work on automatically generating postconditions from code and natural-language artifacts using large language models (LLMs). However, evaluation remains a key bottleneck. Existing benchmarks primarily emphasize correctness under limited evaluation settings, often relying on surface-form matching or manual assessment on small or synthetic datasets. We introduce POSTCONDBENCH, a multilingual benchmark for evaluating method-level postcondition generation from real-world software. POSTCONDBENCH comprises 420 Python and Java tasks drawn from 121 open-source projects, each paired with a high-quality ground-truth postcondition set constructed with expert involvement. To enable automatic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
