Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
Magnus Boman

TL;DR
This paper introduces a formal multi-tape Turing machine model to analyze systematic errors in large language models, clarifying failure modes and the impact of techniques like chain-of-thought prompting.
Contribution
It provides a novel formal framework for localizing LLM failures to specific pipeline stages, offering insights beyond empirical observations.
Findings
Tokenization can obscure character-level information needed for counting tasks.
Chain-of-thought prompting externalizes computation but has fundamental limitations.
The model offers a rigorous, falsifiable approach to understanding LLM errors.
Abstract
Large language models (LLMs) exhibit failure modes on seemingly trivial tasks. We propose a formalisation of LLM interaction using a deterministic multi-tape Turing machine, where each tape represents a distinct component: input characters, tokens, vocabulary, model parameters, activations, probability distributions, and output text. The model enables precise localisation of failure modes to specific pipeline stages, revealing, e.g., how tokenisation obscures character-level structure needed for counting tasks. The model clarifies why techniques like chain-of-thought prompting help, by externalising computation on the output tape, while also revealing their fundamental limitations. This approach provides a rigorous, falsifiable alternative to geometric metaphors and complements empirical scaling laws with principled error analysis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
