From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
Md Erfan, Md Kamal Hossain Chowdhury, Ahmed Ryan, and Md Rayhanur Rahman

TL;DR
This paper demonstrates that open-weight large language models can effectively generate verified code with formal specifications using a tiered prompting strategy and iterative feedback, achieving high verification success rates.
Contribution
It introduces the NL2VC-60 dataset and a tiered prompting approach with self-healing feedback, significantly improving formal verification success in open-weight LLMs.
Findings
Gemma 4-31B achieved 90.91% verification success.
GPT-OSS 120B improved from 0% to 81.82% success.
Structural signatures and iterative prompts greatly enhance verification outcomes.
Abstract
Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to synthesize implementation logic alongside formal specifications that are subsequently proven correct by a mathematical verifier. However, the transition from informal natural language to precise formal specification remains an arduous task. Our work addresses this by providing the NaturalLanguage2VerifiedCode (NL2VC)-60 dataset: a collection of 60 complex algorithmic problems. We evaluate 11 randomly selected problem sets across seven open-weight LLMs using a tiered prompting strategy: contextless prompts, signature prompts providing structural anchors, and self-healing prompts utilizing iterative feedback from the Dafny verifier. To address vacuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
