Process-Driven Autoformalization in Lean 4

Jianqiao Lu; Yingjia Wan; Zhengying Liu; Yinya Huang; Jing Xiong,; Chengwu Liu; Jianhao Shen; Hui Jin; Jipeng Zhang; Haiming Wang; Zhicheng; Yang; Jing Tang; Zhijiang Guo

arXiv:2406.01940·cs.CL·October 15, 2024

Process-Driven Autoformalization in Lean 4

Jianqiao Lu, Yingjia Wan, Zhengying Liu, Yinya Huang, Jing Xiong,, Chengwu Liu, Jianhao Shen, Hui Jin, Jipeng Zhang, Haiming Wang, Zhicheng, Yang, Jing Tang, Zhijiang Guo

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces a new benchmark and a process-supervised verifier to improve autoformalization of natural language mathematics into Lean 4, demonstrating enhanced accuracy and efficiency in formalization tasks.

Contribution

It presents a novel benchmark for autoformalization in Lean 4 and a process-supervised verifier that leverages compiler feedback to improve large language model performance.

Findings

01

PSV improves autoformalization accuracy

02

Fine-tuning with process data yields better results

03

Benchmark enables comprehensive evaluation of autoformalization

Abstract

Autoformalization, the conversion of natural language mathematics into formal languages, offers significant potential for advancing mathematical reasoning. However, existing efforts are limited to formal languages with substantial online corpora and struggle to keep pace with rapidly evolving languages like Lean 4. To bridge this gap, we propose a new benchmark \textbf{Form}alization for \textbf{L}ean~\textbf{4} (\textbf{\name}) designed to evaluate the autoformalization capabilities of large language models (LLMs). This benchmark encompasses a comprehensive assessment of questions, answers, formal statements, and proofs. Additionally, we introduce a \textbf{P}rocess-\textbf{S}upervised \textbf{V}erifier (\textbf{PSV}) model that leverages the precise feedback from Lean 4 compilers to enhance autoformalization. Our experiments demonstrate that the PSV method improves autoformalization,…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

The introduced dataset would be useful for LLM-based auto formalization research - The evaluation is comprehensive and uses human evaluation for autoformalizer performance

Weaknesses

- Line 103: One of the contributions states that - “We propose a process-driven framework PDA that leverages formal languages to provide process feedback on reasoning, enhancing the autoformalization capabilities of LLMs.” This statement is unclear. The PDA framework is specific to Lean4 and the technique or evaluation doesn’t really convince me that the technique would enhance LLMs ability to autoformalize for any formal language. - Line 176: “The real test set is constructed by collecting

Reviewer 02Rating 5Confidence 3

Strengths

- Autoformalization is an important problem, and there is not yet a good benchmark in the literature measuring its success, so the problem this paper tackles is important and relevant - Incorporating execution feedback during autoformalization is a new idea - The method presented in the paper shows strong performance compared to off-the-shelf models. - The benchmark is curated with a combination of automation and manual checking.

Weaknesses

- Because the dataset was constructed from LLM-based informalization, I am not convinced of the quality of the benchmark. I read Appendix F, and while the authors do run a human manual check, only a small number of samples seem to be checked. In addition, the human success rate found was only 72%, which does not seem sufficient when the samples are being used for a benchmark. - The PDA framework uses compiler feedback as a signal for correctness, while the final evaluation is whether the generat

Reviewer 03Rating 1Confidence 4

Strengths

1. Originality - The core idea of using compiler feedback in a systematic way to improve autoformalization is valuable and is implemented in a novel way - The attempt to create a comprehensive dataset for Lean 4 that includes both formal and informal pairs of both statements and proofs addresses an important gap in the field - The process-driven approach to verification represents a new direction in autoformalization 2. Quality - The methodology for preventing formal terms from a

Weaknesses

The strengths above are significantly undermined by: - Systematic and apparently hyperbolic misrepresentation of dataset quality - The lack of proper baselines and controls - Missing adequate qualitative analysis of failure modes - Serious methodological flaws in evaluation - Potentially misleading reporting practices Significant concerns in detail: - L166—167: Is any effort made to ensure that we don’t have pairs of lemmas that are essentially the same (for example, one proving `x=y` and ano

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFlexible and Reconfigurable Manufacturing Systems · Product Development and Customization · Collaboration in agile enterprises