Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation
Chengwen Qi, Ren Ma, Bowen Li, He Du, Binyuan Hui, Jinwang Wu, Yuanjun, Laili, Conghui He

TL;DR
This paper introduces ProverGen, a framework combining Large Language Models and symbolic provers to generate a challenging, diverse, and high-quality dataset for first-order logic reasoning, improving evaluation and training of reasoning models.
Contribution
ProverGen is a novel framework that creates scalable, diverse, and high-quality FOL reasoning datasets by integrating LLMs with symbolic provers, and demonstrates its effectiveness through a new dataset and improved model performance.
Findings
State-of-the-art LLMs struggle with ProverQA problems.
Finetuned Llama3.1-8B-Instruct shows improved reasoning performance.
ProverQA includes coherent intermediate reasoning steps.
Abstract
First-order logic (FOL) reasoning, which involves sequential deduction, is pivotal for intelligent systems and serves as a valuable task for evaluating reasoning capabilities, particularly in chain-of-thought (CoT) contexts. Existing benchmarks often rely on extensive human annotation or handcrafted templates, making it difficult to achieve the necessary complexity, scalability, and diversity for robust evaluation. To address these limitations, we propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, ProverQA. ProverQA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. Our evaluation shows that state-of-the-art LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsSparse Evolutionary Training
