Process-Driven Autoformalization in Lean 4
Jianqiao Lu, Yingjia Wan, Zhengying Liu, Yinya Huang, Jing Xiong,, Chengwu Liu, Jianhao Shen, Hui Jin, Jipeng Zhang, Haiming Wang, Zhicheng, Yang, Jing Tang, Zhijiang Guo

TL;DR
This paper introduces a new benchmark and a process-supervised verifier to improve autoformalization of natural language mathematics into Lean 4, demonstrating enhanced accuracy and efficiency in formalization tasks.
Contribution
It presents a novel benchmark for autoformalization in Lean 4 and a process-supervised verifier that leverages compiler feedback to improve large language model performance.
Findings
PSV improves autoformalization accuracy
Fine-tuning with process data yields better results
Benchmark enables comprehensive evaluation of autoformalization
Abstract
Autoformalization, the conversion of natural language mathematics into formal languages, offers significant potential for advancing mathematical reasoning. However, existing efforts are limited to formal languages with substantial online corpora and struggle to keep pace with rapidly evolving languages like Lean 4. To bridge this gap, we propose a new benchmark \textbf{Form}alization for \textbf{L}ean~\textbf{4} (\textbf{\name}) designed to evaluate the autoformalization capabilities of large language models (LLMs). This benchmark encompasses a comprehensive assessment of questions, answers, formal statements, and proofs. Additionally, we introduce a \textbf{P}rocess-\textbf{S}upervised \textbf{V}erifier (\textbf{PSV}) model that leverages the precise feedback from Lean 4 compilers to enhance autoformalization. Our experiments demonstrate that the PSV method improves autoformalization,…
Peer Reviews
Decision·Submitted to ICLR 2025
The introduced dataset would be useful for LLM-based auto formalization research - The evaluation is comprehensive and uses human evaluation for autoformalizer performance
- Line 103: One of the contributions states that - “We propose a process-driven framework PDA that leverages formal languages to provide process feedback on reasoning, enhancing the autoformalization capabilities of LLMs.” This statement is unclear. The PDA framework is specific to Lean4 and the technique or evaluation doesn’t really convince me that the technique would enhance LLMs ability to autoformalize for any formal language. - Line 176: “The real test set is constructed by collecting
- Autoformalization is an important problem, and there is not yet a good benchmark in the literature measuring its success, so the problem this paper tackles is important and relevant - Incorporating execution feedback during autoformalization is a new idea - The method presented in the paper shows strong performance compared to off-the-shelf models. - The benchmark is curated with a combination of automation and manual checking.
- Because the dataset was constructed from LLM-based informalization, I am not convinced of the quality of the benchmark. I read Appendix F, and while the authors do run a human manual check, only a small number of samples seem to be checked. In addition, the human success rate found was only 72%, which does not seem sufficient when the samples are being used for a benchmark. - The PDA framework uses compiler feedback as a signal for correctness, while the final evaluation is whether the generat
1. Originality - The core idea of using compiler feedback in a systematic way to improve autoformalization is valuable and is implemented in a novel way - The attempt to create a comprehensive dataset for Lean 4 that includes both formal and informal pairs of both statements and proofs addresses an important gap in the field - The process-driven approach to verification represents a new direction in autoformalization 2. Quality - The methodology for preventing formal terms from a
The strengths above are significantly undermined by: - Systematic and apparently hyperbolic misrepresentation of dataset quality - The lack of proper baselines and controls - Missing adequate qualitative analysis of failure modes - Serious methodological flaws in evaluation - Potentially misleading reporting practices Significant concerns in detail: - L166—167: Is any effort made to ensure that we don’t have pairs of lemmas that are essentially the same (for example, one proving `x=y` and ano
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFlexible and Reconfigurable Manufacturing Systems · Product Development and Customization · Collaboration in agile enterprises
