Automated Theorem Provers Help Improve Large Language Model Reasoning

Lachlan McGinness; Peter Baumgartner

arXiv:2408.03492·cs.AI·August 8, 2024

Automated Theorem Provers Help Improve Large Language Model Reasoning

Lachlan McGinness, Peter Baumgartner

PDF

TL;DR

This paper shows how integrating automated theorem provers with large language models enhances logical reasoning accuracy by correcting translation errors and leveraging formal logic for better problem-solving.

Contribution

The paper introduces a novel framework combining LLMs with first-order logic ATPs for automatic error correction in logical reasoning tasks.

Findings

01

Semantic error correction reduces errors significantly.

02

Accuracy of LLM reasoning improves with ATP integration.

03

Framework effectively identifies and corrects translation errors.

Abstract

In this paper we demonstrate how logic programming systems and Automated first-order logic Theorem Provers (ATPs) can improve the accuracy of Large Language Models (LLMs) for logical reasoning tasks where the baseline performance is given by direct LLM solutions. We first evaluate LLM reasoning on steamroller problems using the PRONTOQA benchmark. We show how accuracy can be improved with a neuro-symbolic architecture where the LLM acts solely as a front-end for translating a given problem into a formal logic language and an automated reasoning engine is called for solving it. However, this approach critically hinges on the correctness of the LLM translation. To assess this translation correctness, we secondly define a framework of syntactic and semantic error categories. We implemented the framework and used it to identify errors that LLMs make in the benchmark domain. Based on these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.