LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation

Junyu Lai; Jiakun Zhang; Shuo Xu; Taolue Chen; Zihang Wang; Yao Yang; Jiarui Zhang; Chun Cao; Jingwei Xu

arXiv:2505.12031·cs.AI·May 20, 2025

LLM-based Automated Theorem Proving Hinges on Scalable Synthetic Data Generation

Junyu Lai, Jiakun Zhang, Shuo Xu, Taolue Chen, Zihang Wang, Yao Yang, Jiarui Zhang, Chun Cao, Jingwei Xu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper presents a novel approach to training large language models for automated theorem proving by generating diverse synthetic proof states, improving performance on benchmark datasets.

Contribution

Introduces a proof-state exploration method for synthetic data generation and an adaptive beam size strategy to enhance LLM-based theorem proving.

Findings

01

Achieves 60.74% Pass@1 on MiniF2F benchmark.

02

Attains 21.18% Pass@1 on ProofNet benchmark.

03

Demonstrates the effectiveness of large-scale synthetic data in theorem proving.

Abstract

Recent advancements in large language models (LLMs) have sparked considerable interest in automated theorem proving and a prominent line of research integrates stepwise LLM-based provers into tree search. In this paper, we introduce a novel proof-state exploration approach for training data synthesis, designed to produce diverse tactics across a wide range of intermediate proof states, thereby facilitating effective one-shot fine-tuning of LLM as the policy model. We also propose an adaptive beam size strategy, which effectively takes advantage of our data synthesis method and achieves a trade-off between exploration and exploitation during tree search. Evaluations on the MiniF2F and ProofNet benchmarks demonstrate that our method outperforms strong baselines under the stringent Pass@1 metric, attaining an average pass rate of $60.74%$ on MiniF2F and $21.18%$ on ProofNet. These…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- Well written and easy to follow; the illustrative figures and sampled examples are particularly helpful. - Code is easy to access, and the experiments are reproducible within an academic compute budget.

Weaknesses

- As a data synthesis strategy, I don’t think it has been critically evaluated: How well does the synthetic data perform under standard search techniques (e.g., greedy decoding)? How much does the exploration component actually contribute to the final proof success rate? An ablation study would help clarify this. - It would be helpful to include an algorithmic comparison between the newly proposed tree-search method and existing approaches, rather than only reporting raw performance. It would al

Reviewer 02Rating 6Confidence 3

Strengths

1. The proposed proof state exploration pipeline provides a well-structured and scalable solution for synthetic data generation and the idea of decoupling tactic exploration from premise generation and forcing low-probability tactic sampling is well motivated 2. The experimental results are sound, with detailed comparisons, ablation on beam size, and transparent computational settings

Weaknesses

While the method is conceptually strong, it remains largely heuristic. 1. The pruning and beam-decay hyperparameters are fixed ad hoc without sensitivity analysis or further ablations 2. The methodology section is long and very formalized, but lacks intuition for key ideas like why constrained decoding enhances tactic diversity or how pruning parameters (α, β, γ) are chosen and affect synthesis efficiency 3. The scoring function used in tree search still relies on local log-probabilities in Eq.

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper formulates a well-structured pipeline for data synthesis via proof-state exploration, bridging self-improvement and verifiable proof checking in Lean 4. 2. The reported 60.74 % Pass@1 on MiniF2F is competitive with or better than recent specialized theorem-proving LLMs, suggesting tangible benefits from the proposed training pipeline. 3. The proposed adaptive beam-size strategy which dynamically shrinking the beam with depth is a simple yet effective heuristic that mitigates local-t

Weaknesses

My main concern is that the paper does not report the raw (zero-shot) performance of Qwen2.5-Math-7B on MiniF2F or ProofNet. As a result, it is unclear how much improvement stems from the exploration-based fine-tuning versus the inherent strength of the base model. For fair comparison, the authors should either (a) report base model results, or (b) fine-tune the same base models used by prior baselines (e.g., BFSProver) under the same training pipeline.

Code & Models

Repositories

njudeepengine/llm_based_atp
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLogic, programming, and type systems · Mathematics, Computing, and Information Processing · Machine Learning in Materials Science