SubgoalXL: Subgoal-based Expert Learning for Theorem Proving
Xueliang Zhao, Lin Zheng, Haige Bo, Changran Hu, Urmish Thakker,, Lingpeng Kong

TL;DR
SubgoalXL introduces a subgoal-based expert learning framework that significantly improves the reasoning and proof capabilities of large language models in formal theorem proving, achieving state-of-the-art results on the miniF2F dataset.
Contribution
It presents a novel subgoal-oriented proof strategy combined with expert learning to enhance data efficiency and reasoning in LLMs for formal theorem proving.
Findings
Achieves 56.1% accuracy on miniF2F in Isabelle
Solves 41 AMC12, 9 AIME, and 3 IMO problems from miniF2F
Outperforms previous methods with a 4.9% absolute improvement
Abstract
Formal theorem proving, a field at the intersection of mathematics and computer science, has seen renewed interest with advancements in large language models (LLMs). This paper introduces SubgoalXL, a novel approach that synergizes subgoal-based proofs with expert learning to enhance LLMs' capabilities in formal theorem proving within the Isabelle environment. SubgoalXL addresses two critical challenges: the scarcity of specialized mathematics and theorem-proving data, and the need for improved multi-step reasoning abilities in LLMs. By optimizing data efficiency and employing subgoal-level supervision, SubgoalXL extracts richer information from limited human-generated proofs. The framework integrates subgoal-oriented proof strategies with an expert learning system, iteratively refining formal statement, proof, and subgoal generators. Leveraging the Isabelle environment's advantages in…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The paper tackles a challenging and interesting problem of ML-guided formal theorem proving. It presents a new idea of applying an intermediate step of generating informal "subgoal proofs" before generating the formal proof. The evaluation on miniF2F shows substantial improvement compared to the other approaches. The authors release the code of SubgoalXL (although I didn't test it).
The author do not analyze the inference-time compute in the evaluation. For instance, you write "[a]t each iteration, we trained 4 formal proof generators". Are they all used when evaluating on miniF2F? This makes the inference cost potentially much higher than in the other approaches. It is not described how you evaluate on miniF2F given the trained three components (see questions below). The authors did not disclose the datasets for finetuning the components of SubgoalXL, which makes imposs
+ This paper introduces a novel method to address multi-step reasoning in interactive theorem proving by utilizing LLMs to generate subgoals. This approach is particularly valuable for scaling ITP by automating the decomposition of formal reasoning. + By generating informal-formal pairs of statements and proofs for subgoals, this work contributes to a richer dataset for formal theorem proving. This has implications beyond this paper, as it could aid the broader community in training more robust
- Some evaluation results require additional clarification. In Table 2, the SubgoalXL model outperforms the -subgoal model by 11.5% on the miniF2F-valid dataset but only by 2.8% on the miniF2F-test dataset. This difference raises questions about the model’s consistency across different subsets of the dataset, and further explanation from the authors would help readers interpret this variation. - In Figure 2, the "Overall" setting is presented alongside "With Informal" and "Without Informal" sett
1. **Independent sub-goal creation:** While sub-goal creation for theorem proving is not a new idea, the authors try to make the sub-goal generation independent of the original theorem we are trying to prove. I believe that they could do this because of the choice of ITP (i.e. Isabelle) because proofs in Isabelle are written by adding hypothesis to the existing set of hypotheses, util one of the hypotheses matches the goal. In Lean, proofs are generally not written like this, we try to simplify
**1. Lack of Novelty:** This specific idea already exists in other forms for other ITPs. The LeanStar paper (https://arxiv.org/abs/2407.10040) which the authors cite, follows a very similar approach for Lean. The only difference is that instead of creating a sub-goal, CoT is created to explain the intermediate steps while writing formal Lean proofs. Just like Subgoal XL they also end up doing an expert iteration training. Other than the use of Isabelle instead of Lean, and sub-goal formulation
- The paper constructs an (aligned) informal-formal dataset by combining existing informal or formal datasets, which may be valuable to the community if made publicly available. - SubgoalXL demonstrates strong performance on the miniF2F dataset, validating the effectiveness of its subgoal-based proofs and expert iteration approach.
- The paper is poorly written in some sections, making it hard to follow. First, I find the two key terms, "subgoal-based proofs" and "expert learning", somewhat confusing. Based on my understanding from the paper (e.g., Figures 1 and 6), "subgoal-based proofs" seem to refer to step-by-step informal proofs. However, in the abstract and line 84, the authors mention "leveraging the Isabelle environment’s advantages in subgoal-based proofs." This raises confusion because Isabelle handles formal pr
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Intelligent Tutoring Systems and Adaptive Learning · Software Engineering Research
