Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction
Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, Ximing Lu, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, Chi Jin

TL;DR
Goedel-Prover-V2 introduces innovative training techniques like scaffolded data synthesis and self-correction, achieving state-of-the-art performance in automated theorem proving with smaller models and open-source accessibility.
Contribution
The paper presents a new open-source theorem prover with scaffolded data, self-correction, and model averaging, significantly improving performance over prior models.
Findings
Achieves 88.1% pass@32 on MiniF2F with 32B model
Solves 86 problems on PutnamBench, first among open-source models
Outperforms larger models like DeepSeek-Prover-V2-671B in accuracy
Abstract
We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at…
Peer Reviews
Decision·ICLR 2026 Poster
1. Goedel-Prover-V2 achieves impressive performance, with the 32B model achieving 88.1% under pass@32 on MiniF2F-Test and solving 86 problems on PutnamBench at pass@184. The smaller 8B model also outperforms the larger previous model on MiniF2F. 2. The experiment section is thorough and complete, covering not only core benchmark evaluations but also scaling laws, ablation studies on verifier-guided self-correction, and analysis of model averaging and reinforcement learning dynamics. This allows
Despite being a thorough and complete work, I believe the paper could benefit from improvements in the following areas: 1. The evaluated datasets primarily focus on domains where the underlying knowledge is relatively elementary but requires the application of many tricky techniques, for example, the PutnamBench. As a result, the model may achieve strong performance by simply memorizing those techniques from larger models, rather than demonstrating true improvements in cognitive reasoning. It w
This paper has the potential to be a high impact paper, in my view, if it releases the code for its training method, and releases its training set. If these two items are not released, my recommendation for the paper will be rejection. The training algorithm, especially the model averaging method, is a useful contribution, assuming that code will be released with the paper. The release of the ATP model is a useful and significant contribution for the community. However, the mere release of the
Accuracy of the autoformalizer model is only evaluated on 300 Omni math problems. First, it is not clear what these problems are. Second, it is not clear how the 300 problems relate to the training set of this model. Paper should report the accuracy of its model on known benchmarks such as miniF2F. If the model is trained on the miniF2F, paper should report that to the reader. Evaluating a new model on a new dataset does not provide sufficient information. How is the accuracy of autoformalizer
This paper introduces a novel method for generating synthetic data, potentially one of the strongest reasons. The model filters out bad autoformalisations, and integrates compiler-feedback self-repair directly into whole-proof generation. The model uses `extract_goal` to capture unsolved states from a proof attempt too. The model introduces an interesting method for improving test-time exploration through checkpoint averaging, potentially balancing accuracy and diversity. Last but not least, the
1. The model primarily focuses on MiniF2F and PutnamBench. While these are widely used, including more datasets in the comparison would be better, including ProofNet-test, AIME, AMC, IMO, and DeepSeek-ProverBench. 2. The model averaging improves pass@n but introduces another hyperparameter without a principled selection rule. A more detailed explanation of this would be great. 3. The data-synthesis filter uses LLM autoformalisation and judging; is there any manual annotation on the correctness
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
