Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation

Shaonan Wu; Shuai Lu; Yeyun Gong; Nan Duan; Ping Wei

arXiv:2410.15748·cs.AI·April 4, 2025

Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation

Shaonan Wu, Shuai Lu, Yeyun Gong, Nan Duan, Ping Wei

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces Alchemy, a framework that synthetically enlarges formal theorem datasets via symbolic mutation, significantly boosting neural theorem proving performance on benchmarks.

Contribution

Alchemy is a novel data synthesis method that increases theorem corpus size by an order of magnitude, enhancing neural theorem prover training and performance.

Findings

01

Theorem corpus increased from 110k to 6M using Alchemy.

02

Achieved 4.70% performance improvement on Leandojo benchmark.

03

Gained 2.47% performance boost on miniF2F out-of-distribution benchmark.

Abstract

Formal proofs are challenging to write even for experienced experts. Recent progress in Neural Theorem Proving (NTP) shows promise in expediting this process. However, the formal corpora available on the Internet are limited compared to the general text, posing a significant data scarcity challenge for NTP. To address this issue, this work proposes Alchemy, a general framework for data synthesis that constructs formal theorems through symbolic mutation. Specifically, for each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it. Subsequently, we mutate the candidate theorem by replacing the corresponding term in the statement with its equivalent form or antecedent. As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M. Furthermore, we perform continual pretraining and…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

Some originality: This is a good and new attempt for augmenting a human-written library (although similar ideas have been applied for "pure symbolic and from scratch" methods for generating theorems such as INT, HTPS-Equations and AlphaGeometry) Good quality: the paper is well written and all settings are of high relevance Good clarity: the paper is presented in a clear manner. The experimental setting and results are presented in a standard way and easy to follow Fair significance: The improvem

Weaknesses

1. Poor improvement: although the improvement on pass rate is consistent, it's very limited: ranging from 0.62% to 4.7% on mathlib and only 2.47% on miniF2F (34.01% to 36.48%). This is pretty marginal in terms of improvement. 2. Narrow application possibility: the approach highly replies on a library of existing equivalence (or implying) theorems and their usage in proofs of other theorems.

Reviewer 02Rating 6Confidence 4

Strengths

Originality - The idea of synthesizing new theorem statements through rewrites and applies is new (as far as I'm aware). Quality - Aside from the concerns discussed below, the experiments were carried out well for several variants while adhering to standard benchmarks and search algorithm protocols. - Implementing modifications to data extraction in Lean is likely nontrivial. Clarity - The data synthesis methodology was explained clearly. Significance - Lack of data is widely regarded as a c

Weaknesses

As mentioned in the strengths above, the general direction of augmenting data using symbolic techniques is interesting and under-explored. I have two primary concerns: (1) the experimental evaluation of the proposed techniques; (2) the data augmentation techniques explored in the current paper. ### Experimental evaluation 1. **Baselines**: the baseline method is a LM finetuned on (state, tactic) pairs from Mathlib. However, the proposed method does (i) continued pretraining and (ii) (state, ta

Reviewer 03Rating 6Confidence 4

Strengths

1. The approach is technically robust, with well-documented use of symbolic mutations (specifically rw and apply tactics) to ensure the correctness of new theorems by construction. The improvements seen on Leandojo and miniF2F benchmarks support the framework’s validity. 2. Alchemy significantly increases the number of available theorems in Mathlib, scaling up to 6 million theorems through systematic symbolic mutations. This large corpus helps address the issue of limited formal proof data for t

Weaknesses

1. Marginal Gains in Benchmark Performance: Despite generating millions of new theorems, the gains in miniF2F accuracy are limited to 2.5%, notably lower than the >60% accuracy achieved by SOTA models such as DeepSeekProver and InternLM Prover. This modest improvement raises questions regarding the utility and quality of the synthetic theorems for real-world theorem-proving tasks. 2. Computational Cost: The process of generating and verifying theorems is highly resource-intensive. The implementa

Code & Models

Repositories

wclsn/Alchemy
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, programming, and type systems