Steamroller Problems: An Evaluation of LLM Reasoning Capability with Automated Theorem Prover Strategies
Lachlan McGinness, Peter Baumgartner

TL;DR
This paper evaluates the reasoning capabilities of large language models like GPT-4 and Gemini in automated theorem proving, revealing low correlation between reasoning correctness and answer accuracy, and emphasizing the importance of reasoning strategies and uncertainty analysis.
Contribution
It is the first study to systematically assess LLMs' ability to follow ATP reasoning strategies and explore new NLP methods for analyzing their reasoning processes.
Findings
Low correlation between correct reasoning and correct answers.
Models perform similarly to one-shot chain of thought.
Reasoning strategies help in deriving relevant formulas for external inference.
Abstract
This study presents the first examination of the ability of Large Language Models (LLMs) to follow reasoning strategies that are used to guide Automated Theorem Provers (ATPs). We evaluate the performance of GPT4, GPT3.5 Turbo and Google's recent Gemini model on problems from a steamroller domain. In addition to determining accuracy we make use of the Natural Language Processing library spaCy to explore new methods of investigating LLM's reasoning capabilities. This led to one alarming result, the low correlation between correct reasoning and correct answers for any of the tested models. We found that the models' performance when using the ATP reasoning strategies was comparable to one-shot chain of thought and observe that attention to uncertainty in the accuracy results is critical when drawing conclusions about model performance. Consistent with previous speculation we confirm that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Modeling, Simulation, and Optimization
MethodsSoftmax · Attention Is All You Need · Lib
