Lyra: Orchestrating Dual Correction in Automated Theorem Proving
Chuanyang Zheng, Haiming Wang, Enze Xie, Zhengying Liu, Jiankai Sun,, Huajian Xin, Jianhao Shen, Zhenguo Li, Yu Li

TL;DR
Lyra introduces dual correction mechanisms for LLM-based formal theorem proving, significantly reducing hallucinations and refining conjectures, leading to state-of-the-art performance on miniF2F and solving challenging IMO problems.
Contribution
The paper presents Lyra, a novel framework employing Tool Correction and Conjecture Correction to improve the accuracy and reliability of LLMs in formal theorem proving.
Findings
Achieved state-of-the-art results on miniF2F validation and test sets.
Significantly improved proof accuracy by mitigating hallucinations.
Successfully solved 3 challenging IMO problems using Lyra.
Abstract
Large Language Models (LLMs) present an intriguing avenue for exploration in the field of formal theorem proving. Nevertheless, their full potential, particularly concerning the mitigation of hallucinations and refinement through prover error messages, remains an area that has yet to be thoroughly investigated. To enhance the effectiveness of LLMs in the field, we introduce the Lyra, a new framework that employs two distinct correction mechanisms: Tool Correction (TC) and Conjecture Correction (CC). To implement Tool Correction in the post-processing of formal proofs, we leverage prior knowledge to utilize predefined prover tools (e.g., Sledgehammer) for guiding the replacement of incorrect tools. Tool Correction significantly contributes to mitigating hallucinations, thereby improving the overall accuracy of the proof. In addition, we introduce Conjecture Correction, an error feedback…
Peer Reviews
Decision·Submitted to ICLR 2024
Both Tool Correction and Conjecture Correction are technically sound. Tool Correction implies the insights that LLM is better at generating the next conjecture to prove than closing the conjecture. Extensive search is helpful to close the conjecture.
The paper doesn't have much novelty in terms of the approach. It seems that the set of 11 tactics in TC have been proposed in DSP. The method of appending error message for self-debugging and generating multiple candidates have been commonly used for code generation.
The paper is well written with a clear illustration of its two contributions TC and CC. I especially appreciate the ablation study where Lyra downgrades to DSP without TC and CC. The performance gain also looks good.
Although I very much like the idea of CC, the innovation of TC appears slightly limited. At least to me, it does not involve much interaction with LLMs -- it is more like an exhaustive attempt on a set of heuristically chosen proof methods other than Sledgehammer.
1. The techniques lead to an improvement over the state-of-the-art. 2. I find it a little surprising that models like GPT-4 and Codex are able to understand and act based on the error messages from Isabelle. Unfortunately, the paper does not explore this surprising observation in depth. ----------------------------- Rating updated to 5 after discussion. Overall, I remain unconvinced by the technical contribution of TC but I find the empirical phenomenon of LLMs being able to interpret ITP error
1. I am not an expert in the area so it is possible that the empirical results might be surprising to experts in a way that I am unable to appreciate. However, I find that the techniques used to achieve the state-of-the-art results do not involve any significant technical or empirical insights. 2. The heuristics used in TC seem too specific to the task and dataset. Are the heuristics used for TC transferable to other theorem proving tools and datasets? How do we know that these heuristics are n
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Mathematics, Computing, and Information Processing
