Large Language Models Meet Symbolic Provers for Logical Reasoning   Evaluation

Chengwen Qi; Ren Ma; Bowen Li; He Du; Binyuan Hui; Jinwang Wu; Yuanjun; Laili; Conghui He

arXiv:2502.06563·cs.CL·March 4, 2025

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

Chengwen Qi, Ren Ma, Bowen Li, He Du, Binyuan Hui, Jinwang Wu, Yuanjun, Laili, Conghui He

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces ProverGen, a framework combining Large Language Models and symbolic provers to generate a challenging, diverse, and high-quality dataset for first-order logic reasoning, improving evaluation and training of reasoning models.

Contribution

ProverGen is a novel framework that creates scalable, diverse, and high-quality FOL reasoning datasets by integrating LLMs with symbolic provers, and demonstrates its effectiveness through a new dataset and improved model performance.

Findings

01

State-of-the-art LLMs struggle with ProverQA problems.

02

Finetuned Llama3.1-8B-Instruct shows improved reasoning performance.

03

ProverQA includes coherent intermediate reasoning steps.

Abstract

First-order logic (FOL) reasoning, which involves sequential deduction, is pivotal for intelligent systems and serves as a valuable task for evaluating reasoning capabilities, particularly in chain-of-thought (CoT) contexts. Existing benchmarks often rely on extensive human annotation or handcrafted templates, making it difficult to achieve the necessary complexity, scalability, and diversity for robust evaluation. To address these limitations, we propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, ProverQA. ProverQA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. Our evaluation shows that state-of-the-art LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opendatalab/provergen
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsSparse Evolutionary Training