DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
Ziyin Zhang, Jiahao Xu, Zhiwei He, Tian Liang, Qiuzhi Liu, Yansi Li, Linfeng Song, Zhenwen Liang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu

TL;DR
DeepTheorem introduces a large-scale informal theorem-proving framework using natural language and reinforcement learning to significantly enhance LLM reasoning capabilities in mathematical theorem proving.
Contribution
It presents a new benchmark dataset, a novel RL strategy (RL-Zero), and comprehensive evaluation metrics to improve LLM performance in informal theorem proving.
Findings
DeepTheorem achieves state-of-the-art accuracy in informal theorem proving.
The dataset covers 121K IMO-level theorems across diverse domains.
Reinforcement learning with RL-Zero improves reasoning quality.
Abstract
Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem…
Peer Reviews
Decision·Submitted to ICLR 2026
- Comprehensive experiments across multiple representative model scales. - Introduction of a large, well-structured dataset for informal theorem proving. - Innovative adaptation of RL-Zero to the informal theorem-proving setting.
- Many quality-control steps rely heavily on LLM judgments (e.g., contamination justification and difficulty annotation). A human evaluation of a sampled subset would strengthen confidence in the dataset’s integrity and annotation accuracy. - The process evaluation methodology remains somewhat unclear (even with the prompt shown in the appendix). Compared to formal proof verification (e.g., in Lean), LLM-based evaluation is inherently less reliable. A more detailed comparison against prior proce
1. **Novel idea of verifying NL proofs using LLM:** The paper proposes a logically sound verification method to evaluate natural language theorem proving by following its 4-step criteria. The idea to prove/disprove the variants of the problem and combine all the results together to justify the correctness of the theorem is novel to the field. 2. **Leading benchmark results on open-source models:** The DeepTheorem dataset trained model achieves leading performance in open-source models and can ev
Despite the above strengths, I have a few concerns regarding this paper: 1. **Unclear motivation and positioning:** This paper’s narrative is confusing regarding its claim of formal theorem proving. While this work focuses on informal theorem proving, the major comparison datasets shown in Figure 1 are DS-Prover-V1 and Lean-WB, both of which are based on Lean4 formal proving rather than informal reasoning. Similarly, the experiment section presents results from DS-Prover, which is not designed
* The paper is overall well-written and easy to follow, with an interesting focus on informal theorem proving. * The proposed dataset is large-scale and carefully annotated with detailed information such as problem variants and difficulty levels. * The LLM trained on this dataset achieves better performance than other baselines of the same model scale.
I think the major weakness of this paper lies in the fact that the entire dataset construction and evaluation pipeline is based solely on LLMs. * First, the original problem statements are drawn from existing datasets, which may have been included in the pretraining data of some LLMs. This raises concerns about potential contamination and data leakage. * Second, the generation of variants, the labeling of True or False, the construction of proofs, and even the assignment of difficulty levels a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Multi-Agent Systems and Negotiation · Logic, programming, and type systems
MethodsALIGN
