Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models
Lachlan McGinness, Peter Baumgartner

TL;DR
This paper investigates the reasoning capabilities of the latest large language models, revealing that progress has stalled and that improvements are mainly due to prompt engineering rather than genuine reasoning enhancements.
Contribution
The study provides a comprehensive evaluation of recent LLMs' reasoning abilities, highlighting the stagnation in progress and analyzing the impact of prompting strategies on reasoning performance.
Findings
Progress in LLM reasoning has stalled over nine months.
Most improvements are due to prompt engineering and training strategies.
Current models best follow bottom-up reasoning strategies.
Abstract
Empirical methods to examine the capability of Large Language Models (LLMs) to use Automated Theorem Prover (ATP) reasoning strategies are studied. We evaluate the performance of State of the Art models from December 2023 and August 2024 on PRONTOQA steamroller reasoning problems. For that, we develop methods for assessing LLM response accuracy and correct answer correlation. Our results show that progress in improving LLM reasoning abilities has stalled over the nine month period. By tracking completion tokens, we show that almost all improvement in reasoning ability since GPT-4 was released can be attributed to either hidden system prompts or the training of models to automatically use generic Chain of Thought prompting strategies. Among the ATP reasoning strategies tried, we found that current frontier LLMs are best able to follow the bottom-up (also known as forward-chaining)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multi-Agent Systems and Negotiation
MethodsLinear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Byte Pair Encoding
