The Open Proof Corpus: A Large-Scale Study of LLM-Generated Mathematical Proofs
Jasper Dekoninck, Ivo Petrov, Kristian Minchev, Mislav Balunovic, Martin Vechev, Miroslav Marinov, Maria Drencheva, Lyuba Konova, Milen Shumanov, Kaloyan Tsvetkov, Nikolay Drenchev, Lazar Todorov, Kalina Nikolova, Nikolay Georgiev, Vanesa Kalinkova, Margulan Ismoldayev

TL;DR
The paper introduces the Open Proof Corpus, a large dataset of over 5,000 human-evaluated proofs generated by LLMs, enabling analysis and improvement of automated mathematical proof generation.
Contribution
It provides the first large-scale, high-quality dataset of LLM-generated proofs, including solutions to prestigious math competition problems, and demonstrates its utility through model fine-tuning.
Findings
Identifies performance gaps between natural language and formal proofs.
Highlights discrepancy between answer accuracy and proof validity.
Shows improved proof evaluation by fine-tuning on the dataset.
Abstract
In recent months, large language models (LLMs) have made significant progress in mathematical proof generation, but further advancement is hindered by the lack of a large-scale, high-quality dataset of human-evaluated proofs. While expensive to create, such a dataset is essential for driving improvements in training and enabling a rigorous analysis of proof generation capabilities. In this work, we present the Open Proof Corpus (OPC), a dataset comprising over 5,000 human-evaluated proofs produced by state-of-the-art LLMs. The OPC was specifically designed for broad applicability and downstream usage in proof generation research and is the first to include a substantial number of correct, LLM-generated solutions to problems from prestigious mathematics competitions such as the USAMO and IMO. Using the OPC, we explore critical questions in automated proof generation: (1) the performance…
Peer Reviews
Decision·ICLR 2026 Poster
The problem assessment pipeline is rigorous. The single fine-tuned model on OPC is a welcome addition that supports OPC. The dataset size is sufficiently large to allow finetuning.
- misleading statement: my biggest issue is that apparently not the full dataset was shared, as only USAMO and BMOSL seem to appear in the zipped supplementary (and also for these, not the full dataset? seems quite small), contrary to the claim: "We have included our dataset in the supplementary material, along with detailed descriptions of our methodology and experimental setup to ensure full reproducibility." - ambiguous claim: "The OPC was specifically designed for broad applicability and do
- Addresses critical need: First large-scale dataset of human-evaluated LLM proofs, filling a major gap since existing benchmarks focus only on final answers (e.g., AIME, HMMT). - Methodology: Well-designed grading pipeline using 13 former IMO participants, clear guidelines, double-grading (90.4% agreement), and clever use of LLM-generated issue summaries to aid grading efficiency. - Significant empirical findings: The 4x gap between informal/formal proof generation and the divergence between an
- Binary evaluation loses information: The "5+/7 points counts as correct" threshold is arbitrary and discards nuanced quality differences that partial credit scoring would capture. - Unfair formal/informal comparison: Comparing specialized formal proof models against general-purpose LLMs isn't apples-to-apples. The brief mention of Seed-Prover's 50% formal accuracy suggests the gap may be overstated. - Missing statistical analysis: No confidence intervals, significance tests, or error bars desp
1. A fairly large dataset of math problems is proposed, with its core value lying in expert annotations. Furthermore, both correct and incorrect proofs are included. It is quite unusual for modern math benchmarks and datasets to contain such annotations, making this contribution valuable to the community. 2. Some exciting empirical insights are provided. While I do not feel that the exploration of performance differences between natural and formal proof generation is particularly valuable, since
It might be somewhat subjective, but the presentation quality is low, even considering the number of figures included in the manuscript. While it is understandable that the authors tend to include more content rather than placing all figures in the Appendix, the text in each section is highly granular. I think this weakness could be addressed by rethinking the overall structure. The main focus should be on the dataset itself, which is valuable, while the incremental contributions in the form of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, programming, and type systems · Mathematics, Computing, and Information Processing · Polynomial and algebraic computation
