Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation
Shaonan Wu, Shuai Lu, Yeyun Gong, Nan Duan, Ping Wei

TL;DR
This paper introduces Alchemy, a framework that synthetically enlarges formal theorem datasets via symbolic mutation, significantly boosting neural theorem proving performance on benchmarks.
Contribution
Alchemy is a novel data synthesis method that increases theorem corpus size by an order of magnitude, enhancing neural theorem prover training and performance.
Findings
Theorem corpus increased from 110k to 6M using Alchemy.
Achieved 4.70% performance improvement on Leandojo benchmark.
Gained 2.47% performance boost on miniF2F out-of-distribution benchmark.
Abstract
Formal proofs are challenging to write even for experienced experts. Recent progress in Neural Theorem Proving (NTP) shows promise in expediting this process. However, the formal corpora available on the Internet are limited compared to the general text, posing a significant data scarcity challenge for NTP. To address this issue, this work proposes Alchemy, a general framework for data synthesis that constructs formal theorems through symbolic mutation. Specifically, for each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it. Subsequently, we mutate the candidate theorem by replacing the corresponding term in the statement with its equivalent form or antecedent. As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M. Furthermore, we perform continual pretraining and…
Peer Reviews
Decision·ICLR 2025 Poster
Some originality: This is a good and new attempt for augmenting a human-written library (although similar ideas have been applied for "pure symbolic and from scratch" methods for generating theorems such as INT, HTPS-Equations and AlphaGeometry) Good quality: the paper is well written and all settings are of high relevance Good clarity: the paper is presented in a clear manner. The experimental setting and results are presented in a standard way and easy to follow Fair significance: The improvem
1. Poor improvement: although the improvement on pass rate is consistent, it's very limited: ranging from 0.62% to 4.7% on mathlib and only 2.47% on miniF2F (34.01% to 36.48%). This is pretty marginal in terms of improvement. 2. Narrow application possibility: the approach highly replies on a library of existing equivalence (or implying) theorems and their usage in proofs of other theorems.
Originality - The idea of synthesizing new theorem statements through rewrites and applies is new (as far as I'm aware). Quality - Aside from the concerns discussed below, the experiments were carried out well for several variants while adhering to standard benchmarks and search algorithm protocols. - Implementing modifications to data extraction in Lean is likely nontrivial. Clarity - The data synthesis methodology was explained clearly. Significance - Lack of data is widely regarded as a c
As mentioned in the strengths above, the general direction of augmenting data using symbolic techniques is interesting and under-explored. I have two primary concerns: (1) the experimental evaluation of the proposed techniques; (2) the data augmentation techniques explored in the current paper. ### Experimental evaluation 1. **Baselines**: the baseline method is a LM finetuned on (state, tactic) pairs from Mathlib. However, the proposed method does (i) continued pretraining and (ii) (state, ta
1. The approach is technically robust, with well-documented use of symbolic mutations (specifically rw and apply tactics) to ensure the correctness of new theorems by construction. The improvements seen on Leandojo and miniF2F benchmarks support the framework’s validity. 2. Alchemy significantly increases the number of available theorems in Mathlib, scaling up to 6 million theorems through systematic symbolic mutations. This large corpus helps address the issue of limited formal proof data for t
1. Marginal Gains in Benchmark Performance: Despite generating millions of new theorems, the gains in miniF2F accuracy are limited to 2.5%, notably lower than the >60% accuracy achieved by SOTA models such as DeepSeekProver and InternLM Prover. This modest improvement raises questions regarding the utility and quality of the synthetic theorems for real-world theorem-proving tasks. 2. Computational Cost: The process of generating and verifying theorems is highly resource-intensive. The implementa
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, programming, and type systems
