SubgoalXL: Subgoal-based Expert Learning for Theorem Proving

Xueliang Zhao; Lin Zheng; Haige Bo; Changran Hu; Urmish Thakker,; Lingpeng Kong

arXiv:2408.11172·cs.LG·August 22, 2024

SubgoalXL: Subgoal-based Expert Learning for Theorem Proving

Xueliang Zhao, Lin Zheng, Haige Bo, Changran Hu, Urmish Thakker,, Lingpeng Kong

PDF

Open Access 1 Repo 4 Models 4 Reviews

TL;DR

SubgoalXL introduces a subgoal-based expert learning framework that significantly improves the reasoning and proof capabilities of large language models in formal theorem proving, achieving state-of-the-art results on the miniF2F dataset.

Contribution

It presents a novel subgoal-oriented proof strategy combined with expert learning to enhance data efficiency and reasoning in LLMs for formal theorem proving.

Findings

01

Achieves 56.1% accuracy on miniF2F in Isabelle

02

Solves 41 AMC12, 9 AIME, and 3 IMO problems from miniF2F

03

Outperforms previous methods with a 4.9% absolute improvement

Abstract

Formal theorem proving, a field at the intersection of mathematics and computer science, has seen renewed interest with advancements in large language models (LLMs). This paper introduces SubgoalXL, a novel approach that synergizes subgoal-based proofs with expert learning to enhance LLMs' capabilities in formal theorem proving within the Isabelle environment. SubgoalXL addresses two critical challenges: the scarcity of specialized mathematics and theorem-proving data, and the need for improved multi-step reasoning abilities in LLMs. By optimizing data efficiency and employing subgoal-level supervision, SubgoalXL extracts richer information from limited human-generated proofs. The framework integrates subgoal-oriented proof strategies with an expert learning system, iteratively refining formal statement, proof, and subgoal generators. Leveraging the Isabelle environment's advantages in…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 3

Strengths

The paper tackles a challenging and interesting problem of ML-guided formal theorem proving. It presents a new idea of applying an intermediate step of generating informal "subgoal proofs" before generating the formal proof. The evaluation on miniF2F shows substantial improvement compared to the other approaches. The authors release the code of SubgoalXL (although I didn't test it).

Weaknesses

The author do not analyze the inference-time compute in the evaluation. For instance, you write "[a]t each iteration, we trained 4 formal proof generators". Are they all used when evaluating on miniF2F? This makes the inference cost potentially much higher than in the other approaches. It is not described how you evaluate on miniF2F given the trained three components (see questions below). The authors did not disclose the datasets for finetuning the components of SubgoalXL, which makes imposs

Reviewer 02Rating 6Confidence 3

Strengths

+ This paper introduces a novel method to address multi-step reasoning in interactive theorem proving by utilizing LLMs to generate subgoals. This approach is particularly valuable for scaling ITP by automating the decomposition of formal reasoning. + By generating informal-formal pairs of statements and proofs for subgoals, this work contributes to a richer dataset for formal theorem proving. This has implications beyond this paper, as it could aid the broader community in training more robust

Weaknesses

- Some evaluation results require additional clarification. In Table 2, the SubgoalXL model outperforms the -subgoal model by 11.5% on the miniF2F-valid dataset but only by 2.8% on the miniF2F-test dataset. This difference raises questions about the model’s consistency across different subsets of the dataset, and further explanation from the authors would help readers interpret this variation. - In Figure 2, the "Overall" setting is presented alongside "With Informal" and "Without Informal" sett

Reviewer 03Rating 3Confidence 4

Strengths

1. **Independent sub-goal creation:** While sub-goal creation for theorem proving is not a new idea, the authors try to make the sub-goal generation independent of the original theorem we are trying to prove. I believe that they could do this because of the choice of ITP (i.e. Isabelle) because proofs in Isabelle are written by adding hypothesis to the existing set of hypotheses, util one of the hypotheses matches the goal. In Lean, proofs are generally not written like this, we try to simplify

Weaknesses

**1. Lack of Novelty:** This specific idea already exists in other forms for other ITPs. The LeanStar paper (https://arxiv.org/abs/2407.10040) which the authors cite, follows a very similar approach for Lean. The only difference is that instead of creating a sub-goal, CoT is created to explain the intermediate steps while writing formal Lean proofs. Just like Subgoal XL they also end up doing an expert iteration training. Other than the use of Isabelle instead of Lean, and sub-goal formulation

Reviewer 04Rating 3Confidence 5

Strengths

- The paper constructs an (aligned) informal-formal dataset by combining existing informal or formal datasets, which may be valuable to the community if made publicly available. - SubgoalXL demonstrates strong performance on the miniF2F dataset, validating the effectiveness of its subgoal-based proofs and expert iteration approach.

Weaknesses

- The paper is poorly written in some sections, making it hard to follow. First, I find the two key terms, "subgoal-based proofs" and "expert learning", somewhat confusing. Based on my understanding from the paper (e.g., Figures 1 and 6), "subgoal-based proofs" seem to refer to step-by-step informal proofs. However, in the abstract and line 84, the authors mention "leveraging the Isabelle environment’s advantages in subgoal-based proofs." This raises confusion because Isabelle handles formal pr

Code & Models

Repositories

zhaoxlpku/subgoalxl
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Intelligent Tutoring Systems and Adaptive Learning · Software Engineering Research