Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Chuanhao Yan; Fengdi Che; Xuhan Huang; Xu Xu; Xin Li; Yizhi Li; Xingwei Qu; Jingzhe Shi; Chenghua Lin; Yaodong Yang; Binhang Yuan; Hang Zhao; Yu Qiao; Bowen Zhou; Jie Fu

arXiv:2507.16331·cs.CL·October 14, 2025

Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

Chuanhao Yan, Fengdi Che, Xuhan Huang, Xu Xu, Xin Li, Yizhi Li, Xingwei Qu, Jingzhe Shi, Chenghua Lin, Yaodong Yang, Binhang Yuan, Hang Zhao, Yu Qiao, Bowen Zhou, Jie Fu

PDF

Open Access 6 Models 2 Datasets

TL;DR

This paper explores reducing human priors in formal software verification by leveraging reinforcement learning and formal languages like Dafny, enabling scalable, verifiable reasoning in LLMs without extensive human annotations.

Contribution

It introduces a scalable data curation pipeline, a formal benchmark DafnyComp, and demonstrates improved LLM performance in generating verifiable Dafny code with less reliance on human priors.

Findings

01

Supervised fine-tuning enables small models to generate verifiable Dafny code.

02

RL with regularization enhances out-of-domain generalization.

03

Models outperform proprietary baselines on DafnyComp benchmark.

Abstract

Existing informal language-based (e.g., human language) Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes, which provide crucial training signals, are neither reliable nor scalable. In fact, the prevalent large proprietary models could hardly generate verifiable programs. A promising yet largely uncharted alternative is formal language-based reasoning. Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes. This capability is pivotal for achieving large-scale, reliable formal software verification. It is a common practice to employ human-annotated chain-of-thought and other human priors to induce the reasoning and coding capabilities of LLMs.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Machine Learning and Data Classification · Software Engineering Research