Lean-STaR: Learning to Interleave Thinking and Proving

Haohan Lin; Zhiqing Sun; Sean Welleck; Yiming Yang

arXiv:2407.10040·cs.AI·March 18, 2025·1 cites

Lean-STaR: Learning to Interleave Thinking and Proving

Haohan Lin, Zhiqing Sun, Sean Welleck, Yiming Yang

PDF

Open Access 3 Reviews

TL;DR

Lean-STaR introduces a novel approach where language models generate informal thoughts before proof steps, significantly improving theorem-proving performance by leveraging additional reasoning context.

Contribution

The paper presents Lean-STaR, a framework that trains models to produce informal thoughts prior to proof steps, enhancing proof accuracy and state-of-the-art results on miniF2F.

Findings

01

Achieves 46.3% Pass@64 on miniF2F benchmark

02

Outperforms baseline models significantly

03

Provides insights into the role of informal thoughts in proof success

Abstract

Traditional language model-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the resulting code. We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof, thereby boosting the model's theorem-proving capabilities. Lean-STaR uses retrospective ground-truth tactics to generate synthetic thoughts for training the language model. At inference time, the trained model directly generates the thoughts prior to the prediction of the tactics in each proof step. Building on the self-taught reasoner framework, we then…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 4

Strengths

- The problem the paper aims to address is long-standing, and the paper provides strong empirical results demonstrating that the Chain-of-Thought (CoT) reasoning before each tactic application is useful. - The newly proposed sampling method is novel and, based on its performance, appears to be a promising approach compared to Best-First Search. - The proposed CoT-augmented dataset should be useful for future research in neural theorem proving. - The paper is well-written and easy to follow.

Weaknesses

- One issue with the generated COT is the absence of a verification mechanism or alternative methods to ensure its quality. The paper does not provide a detailed evaluation of these generated COTs; it only presents the final pass rate information. I believe that having a robust evaluation is crucial for improving the performance of this method. Moreover, in prompting GPT-4 to generate COTs, the authors have only explored one approach in constructing the prompt. I hope for a more in-depth analysi

Reviewer 02Rating 6Confidence 4

Strengths

- The idea for constructing and training a tactic predictor p(tactic | proof context) = Expectation_{thought ~ p(thought | proof context)} p(tactic | thought, proof context) by introducing a latent variable in the form of thoughts is novel. - There are numerous experiments.

Weaknesses

The presentation of the paper could be improved and I have one major concern about soundness of the results (see questions for the authors). - In general, I found the abbreviations and terminology of the paper to be confusing. For example, the text refers to the dataset as a thought-augmented dataset but it is abbreviated as CoT dataset. A model for direct tactic prediction is abbreviated as a SFT model. Lean-STaR is both the name of the framework and a model that is produced by the framework.

Reviewer 03Rating 8Confidence 4

Strengths

**1. Simple yet very effective approach:** The authors use retrospective rationale generation to come up with the thoughts given the state and ground truth action. The idea to train another model Lean-CoT for generation of thought-augmented tactics, is simple but works very well in this setting. **2. Use of expert iteration while informal thought generation:** While expert iteration has been used in previous works to improve formal theorem proving, the authors demonstrate that the same holds tr

Weaknesses

**1. Data Leakage:** While the authors talk about the limitations of their approach, they don't discuss potential data leakage due to the use of GPT-4 for generating thought annotations. It can very well happen that miniF2F-related proof-steps are leaked in the thought generation process. Also, miniF2F has been used for a while for evaluating approaches for formal theorem proving, there is a chance that the base pre-trained model itself is trained on miniF2F repository. The worst part is that pr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuality and Supply Management

MethodsBalanced Selection