Automated Proof Generation for Rust Code via Self-Evolution
Tianyu Chen, Shuai Lu, Shan Lu, Yeyun Gong, Chenyuan Yang, Xuheng Li, Md Rakib Hossain Misu, Hao Yu, Nan Duan, Peng Cheng, Fan Yang, Shuvendu K Lahiri, Tao Xie, Lidong Zhou

TL;DR
SAFE is a novel framework that automates proof generation for Rust code by leveraging self-evolution, data synthesis, and symbolic verification, significantly improving accuracy over existing models.
Contribution
It introduces a self-evolving cycle combining data synthesis, fine-tuning, and verification to enable automated proof generation for Rust code, overcoming data scarcity.
Findings
Achieves 52.52% accuracy on a human-crafted benchmark.
Outperforms GPT-4o in proof generation precision.
Demonstrates the effectiveness of self-debugging with synthesized incorrect proofs.
Abstract
Ensuring correctness is crucial for code generation. Formal verification offers a definitive assurance of correctness, but demands substantial human effort in proof construction and hence raises a pressing need for automation. The primary obstacle lies in the severe lack of data-there is much fewer proofs than code snippets for Large Language Models (LLMs) to train upon. In this paper, we introduce SAFE, a framework that overcomes the lack of human-written proofs to enable automated proof generation of Rust code. SAFE establishes a self-evolving cycle where data synthesis and fine-tuning collaborate to enhance the model capability, leveraging the definitive power of a symbolic verifier in telling correct proofs from incorrect ones. SAFE also re-purposes the large number of synthesized incorrect proofs to train the self-debugging capability of the fine-tuned models, empowering them to…
Peer Reviews
Decision·ICLR 2025 Poster
This paper studies automating proof generation in formal program verification with LLMs, an important direction with great potential for practical applications. The focus is on Rust, a relatively new language that is gaining widespread adoption. Although synthetic data generation for fine-tuning LLMs is not a completely novel idea, the paper introduces a few interesting techniques for the domain of proof generation for Rust. I particularly like the metric for filtering high-quality specification
The paper only focuses on small programs in the style of MBPP and CodeNet. Although I understand this is partly due to the limitation of the Verus tool, I do believe that the paper should present some case studies or discussion on how to scale the approach to real-world software projects. Apart from proof generation, a major part of formal verification is writing the specifications. The paper covers mechanisms to fine-tune a “good” specification generation. It would strengthen the paper if more
1. This paper proposes a novel approach that use self-evolving to iteratively improve LLM's ability of generating Rust specification and proofs. The fact that this approach does not rely on larger LLM such as GPT-4o in the following iterations (except the first round) makes it more generalizable and scalable. 2. The proposed approach shows great effectiveness, with three round of self-evolving, the fine-tuned LLM shows about 40% higher accuracy@1 compared to the prompting approach. 3. Comprehens
1. The self-debugging ability is shown to be only effective for the first time, what could be potential approach for improving the self-debugging ability in the following rounds? 2. I am wondering if this self-evolving approach can improve smaller LLMs ability. For instance, if the backbone is DeepSeekCoder-1.3B, how effective is the self-evolving approach?
- The paper is well-written and nicely structured. The figures and tables are well formatted and legible. - The story is engaging and the tackled topic highly relevant. - The results are clearly presented and provide interesting insights.
- In Table 1 the Accuracy of GPT-4o on VerusBench @100 is unfortunately missing (likely due to high inference cost?). Similarly the result of DeepSeekCoder RAW @100 is missing. If the authors could provide these values, the tables would provide a much more complete picture. - In Table 2, Round 3 appears to severely degrade performance of the resulting model on the Tutorial dataset. Does this constitute some first signs of overfitting or collapse or could the authors provide some more insight on
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolutionary Algorithms and Applications · Cellular Automata and Applications · Algorithms and Data Compression
