Can Large Language Models Model Programs Formally?
Zhiyong Chen, Jialun Cao, Jiarong Wu, Chang Xu, Shing-Chi Cheung

TL;DR
This paper introduces Model-Bench, a benchmark for evaluating large language models' ability to convert Python programs into model checking specifications, highlighting current limitations and future directions.
Contribution
It presents a new benchmark and pipeline for assessing LLMs' program modeling capabilities for formal verification tasks.
Findings
LLMs show significant limitations in modeling programs for formal verification.
Model-Bench includes 400 Python programs from established benchmarks.
Experiments reveal gaps and inspire directions for improving LLMs in formal modeling.
Abstract
In the digital age, ensuring the correctness, safety, and reliability of software through formal verification is paramount, particularly as software increasingly underpins critical infrastructure. Formal verification, split into theorem proving and model checking, provides a feasible and reliable path. Unlike theorem proving, which yields notable advances, model checking has been less focused due to the difficulty of automatic program modeling. To fill this gap, we introduce Model-Bench, a benchmark and an accompanying pipeline for evaluating and improving LLMs' program modeling capability by modeling Python programs into verification-ready model checking specifications checkable by its accompanying model checker. Model-Bench comprises 400 Python programs derived from three well-known benchmarks (HumanEval, MBPP, and LiveCodeBench). Our extensive experiments reveal significant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
