Lyra: Orchestrating Dual Correction in Automated Theorem Proving

Chuanyang Zheng; Haiming Wang; Enze Xie; Zhengying Liu; Jiankai Sun,; Huajian Xin; Jianhao Shen; Zhenguo Li; Yu Li

arXiv:2309.15806·cs.CL·August 27, 2024·2 cites

Lyra: Orchestrating Dual Correction in Automated Theorem Proving

Chuanyang Zheng, Haiming Wang, Enze Xie, Zhengying Liu, Jiankai Sun,, Huajian Xin, Jianhao Shen, Zhenguo Li, Yu Li

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Lyra introduces dual correction mechanisms for LLM-based formal theorem proving, significantly reducing hallucinations and refining conjectures, leading to state-of-the-art performance on miniF2F and solving challenging IMO problems.

Contribution

The paper presents Lyra, a novel framework employing Tool Correction and Conjecture Correction to improve the accuracy and reliability of LLMs in formal theorem proving.

Findings

01

Achieved state-of-the-art results on miniF2F validation and test sets.

02

Significantly improved proof accuracy by mitigating hallucinations.

03

Successfully solved 3 challenging IMO problems using Lyra.

Abstract

Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framework that employs two distinct correction mechanisms: Tool Correction (TC) and Conjecture Correction (CC). To implement Tool Correction in the post-processing of formal proofs, we leverage prior knowledge to utilize predefined prover tools (e.g., Sledgehammer) for guiding the replacement of incorrect tools. Tool Correction significantly contributes to mitigating hallucinations, thereby improving the overall accuracy of the proof. In addition, we introduce Conjecture Correction, an error feedback…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

Both Tool Correction and Conjecture Correction are technically sound. Tool Correction implies the insights that LLM is better at generating the next conjecture to prove than closing the conjecture. Extensive search is helpful to close the conjecture.

Weaknesses

The paper doesn't have much novelty in terms of the approach. It seems that the set of 11 tactics in TC have been proposed in DSP. The method of appending error message for self-debugging and generating multiple candidates have been commonly used for code generation.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The paper is well written with a clear illustration of its two contributions TC and CC. I especially appreciate the ablation study where Lyra downgrades to DSP without TC and CC. The performance gain also looks good.

Weaknesses

Although I very much like the idea of CC, the innovation of TC appears slightly limited. At least to me, it does not involve much interaction with LLMs -- it is more like an exhaustive attempt on a set of heuristically chosen proof methods other than Sledgehammer.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The techniques lead to an improvement over the state-of-the-art. 2. I find it a little surprising that models like GPT-4 and Codex are able to understand and act based on the error messages from Isabelle. Unfortunately, the paper does not explore this surprising observation in depth. ----------------------------- Rating updated to 5 after discussion. Overall, I remain unconvinced by the technical contribution of TC but I find the empirical phenomenon of LLMs being able to interpret ITP error

Weaknesses

1. I am not an expert in the area so it is possible that the empirical results might be surprising to experts in a way that I am unable to appreciate. However, I find that the techniques used to achieve the state-of-the-art results do not involve any significant technical or empirical insights. 2. The heuristics used in TC seem too specific to the task and dataset. Are the heuristics used for TC transferable to other theorem proving tools and datasets? How do we know that these heuristics are n

Code & Models

Repositories

chuanyang-zheng/lyra-theorem-prover
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Mathematics, Computing, and Information Processing