The Illusion of Procedural Reasoning: Measuring Long-Horizon FSM Execution in LLMs
Mahdi Samiei, Mahdi Mansouri, Mahdieh Soleymani Baghshah

TL;DR
This paper introduces a benchmark using finite-state machines to evaluate and diagnose the procedural reasoning capabilities of large language models, revealing systematic degradation with increased complexity and highlighting areas for improvement.
Contribution
We propose FSM execution as an interpretable, controlled benchmark for assessing LLMs' long-horizon procedural reasoning, providing insights into their internal fidelity and failure modes.
Findings
Models degrade with increased task horizon and complexity.
Larger models improve local accuracy but remain brittle in multi-step reasoning.
Explicit prompting to externalize steps enhances model robustness.
Abstract
Large language models (LLMs) have achieved remarkable results on tasks framed as reasoning problems, yet their true ability to perform procedural reasoning, executing multi-step, rule-based computations remains unclear. Unlike algorithmic systems, which can deterministically execute long-horizon symbolic procedures, LLMs often degrade under extended reasoning chains, but there is no controlled, interpretable benchmark to isolate and measure this collapse. We introduce Finite-State Machine (FSM) Execution as a minimal, fully interpretable framework for evaluating the procedural reasoning capacity of LLMs. In our setup, the model is given an explicit FSM definition and must execute it step-by-step given input actions, maintaining state consistency over multiple turns. This task requires no world knowledge, only faithful application of deterministic transition rules, making it a direct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
