Herald: A Natural Language Annotated Lean 4 Dataset
Guoxiong Gao, Yutong Wang, Jiedong Jiang, Qi Gao, Zihan Qin, Tianyi, Xu, Bin Dong

TL;DR
This paper introduces Herald, a dataset and translation framework that align natural language with Lean 4 formal proofs, significantly improving automated reasoning capabilities in mathematical language models.
Contribution
It presents a novel dataset and translation system for formal language Lean 4, enabling better training of language models for mathematical reasoning tasks.
Findings
Herald translator achieves 93.2% accuracy on miniF2F-test
Outperforms existing models like InternLM2-Math-Plus-7B and TheoremLlama
Successfully translated a section of graduate-level mathematical literature
Abstract
Verifiable formal languages like Lean have profoundly impacted mathematical reasoning, particularly through the use of large language models (LLMs) for automated reasoning. A significant challenge in training LLMs for these formal languages is the lack of parallel datasets that align natural language with formal language proofs. To address this challenge, this paper introduces a novel framework for translating the Mathlib4 corpus (a unified library of mathematics in formal language Lean 4) into natural language. Building upon this, we employ a dual augmentation strategy that combines tactic-based and informal-based approaches, leveraging the Lean-jixia system, a Lean 4 analyzer. We present the results of this pipeline on Mathlib4 as Herald (Hierarchy and Retrieval-based Translated Lean Dataset). We also propose the Herald Translator, which is fine-tuned on Herald. Herald translator…
Peer Reviews
Decision·ICLR 2025 Poster
1. The authors provide a clear rationale for their methods and promise to open-source the whole pipeline for data generation. This is helpful for future research in AI-assisted math. 2. The evaluation results look impressive, and they are beyond standard benchmarks. The authors demonstrated Herald translator's effectiveness on a real-world formalization project.
1. Lack of ablation studies. It's not clear to me which parts of the data generation/augmentation are helpful. More ablation experiments would help, especially for the multilinguistic augmentation. I'm really curious if that would result in NL statements in different languages. 2. Does DeepSeek-Prover-1.5-Instruct perform well on formalization? If so, it would be interesting to see how herald translator compares with it. 3. Contamination: Could the data augmentation/curation process introduce co
- The proposed augmentation pipeline is well-designed. In particular, it's a great idea to use the results of structural analysis to order statement translation within proofs to allow informalization to condition on both formal and natural language versions of a statement's logical context. Herald definitely seems to me to be a step forward in terms of the quality of augmented examples. - While many of the comparisons this paper makes between Herald and prior work are qualitative, I found the d
- It's hard for me to get a sense of the significance of the fine-tuning results, given that both the chosen metric and two of the three datasets used for evaluation are constructed by the authors. It would be helpful to see some representative examples of the two custom test sets, in addition to samples from the model that pass and fail each validation step. - I also feel that it's essential to audit the LM NLI check and report its correlation with human expert judgements of semantic equivalen
- The paper extends the boundaries of autoformalization using a Retrieval-Augmented Generation (RAG) based method. - The paper develops an innovative, practical tool that helps extend its reach to users beyond the community. - The paper's method is simple, and the writing is well-done.
- I believe that a contextual learning approach was used in the translation process. There should be a detailed investigation into how different retrieval methods affect it, as they directly impact the accuracy of the model's output. - Why not compare it with other autoformalization methods, such as [1,2]. **Reference** [1] FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving. [2] FormalAlign: Automated Alignment Evaluation for Autoformalization
Videos
Taxonomy
TopicsBig Data and Business Intelligence · Flexible and Reconfigurable Manufacturing Systems · Advanced Manufacturing and Logistics Optimization
MethodsLib · ALIGN
