LEAN-GitHub: Compiling GitHub LEAN repositories for a versatile LEAN prover
Zijian Wu, Jiayu Wang, Dahua Lin, Kai Chen

TL;DR
This paper introduces LEAN-GitHub, a large-scale formal dataset from Lean 4 repositories, which, when used to fine-tune a language model, significantly improves formal mathematical reasoning performance across multiple benchmarks.
Contribution
The paper presents LEAN-GitHub, a novel dataset from Lean 4 repositories, and demonstrates its effectiveness in enhancing formal reasoning models beyond existing methods.
Findings
Model achieved 48.8% accuracy with one pass on miniF2F.
Model achieved 54.5% accuracy with 64 passes on miniF2F.
State-of-the-art results on ProofNet and Putnam benchmarks.
Abstract
Recently, large language models have presented promising results in aiding formal mathematical reasoning. However, their performance is restricted due to the scarcity of formal theorem-proving data, which requires additional effort to be extracted from raw formal language corpora. Meanwhile, a significant amount of human-written formal language corpora remains underutilized. To address this issue, we propose LEAN-GitHub, a dataset consisting of large-scale formal data extracted from almost all Lean 4 repositories on GitHub. After fine-tuning InternLM-math-plus on this dataset, our model achieved accuracies of 48.8% with a single pass and 54.5% with 64 passes on the Lean 4 miniF2F test, surpassing state-of-the-art method at 52%. And it also achieves state-of-the-art on two other Lean 4 benchmarks (ProofNet and Putnam) targeting different fields/levels of math. These results demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
