Softmax Transformers are Turing-Complete

Hongjian Jiang; Michael Hahn; Georg Zetzsche; Anthony Widjaja Lin

arXiv:2511.20038·cs.FL·November 26, 2025

Softmax Transformers are Turing-Complete

Hongjian Jiang, Michael Hahn, Georg Zetzsche, Anthony Widjaja Lin

PDF

Open Access 3 Reviews

TL;DR

This paper proves that length-generalizable softmax attention Chain-of-Thought transformers are Turing-complete, extending the understanding of their computational power and capabilities in complex reasoning tasks.

Contribution

It establishes Turing-completeness of softmax CoT transformers with length generalization and positional encoding, a significant theoretical advancement.

Findings

01

Turing-completeness for CoT C-RASP with unary alphabet

02

Turing-completeness extension with relative positional encoding

03

Empirical validation on complex arithmetic reasoning tasks

Abstract

Hard attention Chain-of-Thought (CoT) transformers are known to be Turing-complete. However, it is an open problem whether softmax attention Chain-of-Thought (CoT) transformers are Turing-complete. In this paper, we prove a stronger result that length-generalizable softmax CoT transformers are Turing-complete. More precisely, our Turing-completeness proof goes via the CoT extension of the Counting RASP (C-RASP), which correspond to softmax CoT transformers that admit length generalization. We prove Turing-completeness for CoT C-RASP with causal masking over a unary alphabet (more generally, for letter-bounded languages). While we show this is not Turing-complete for arbitrary languages, we prove that its extension with relative positional encoding is Turing-complete for arbitrary languages. We empirically validate our theory by training transformers for languages requiring complex…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 10Confidence 2

Strengths

- This paper takes a step towards proving the Turing-completeness of transformers under more realistic architecture definitions. Softmax (rather than hardmax) is differentiable, so this proof is the first one for a transformer architecture definition that is trainable via gradient methods. - The proof of Turing completeness goes through different techniques than previous proofs for hardmax attention. I am not familiar enough with these techniques to know if they are standard in formal language t

Weaknesses

- Through Section 2, many proofs of the results, definitions etc. seem to taken nearly directly from Huang et al (2025). This makes it hard to read without first reading Huang et al. There is significant setup assumed from Huang et al, some of which is never described in the paper itself. Generally, it seems worth defining notation before it is used, even if the notation is standard in formal language theory. Notation was not defined for the definition of C-RASP, CoT C-RASP and in many other pl

Reviewer 02Rating 4Confidence 2

Strengths

The paper is able to answer the relevant question whether softmax-attention Chain-of-Thought (CoT) transformers are Turing-complete positively, thereby extending previous work that only showed Turing-completeness using hardmax-attention. Furthermore, it is insightful to see that CoT C-RASPs with causal masking are not generally Turing complete, but that they do become Turing complete when adding Relative Positional Encodings.

Weaknesses

I think the paper could profit from additional clarity in the writing and in the explanations. I was for example slightly confused by phrases like 'to provide _a kind of_ guarantee of trainability' (in the contributions). Furthermore, I found the section on the 'Empirical Experiments' not very clearly written and would have been interested in seeing slightly more details and explanations on the tasks, the architectures used and the training. The corresponding Appendix D was very short and did no

Reviewer 03Rating 2Confidence 3

Strengths

The paper consider an important theoretical question, i.e., whether Turing-completeness holds for more realistic models such as softmax Transformers.

Weaknesses

1. The expression "Turing-completeness for some languages" is conceptually unclear. Turing-completeness has a strict formal definition. The results presented in the paper do not appear to fully establish the claim. 2. Softmax Transformers should, in principle, approximate hard-attention arbitrarily well. Since hard-attention Transformers are known to be Turing-complete, the paper’s results suggesting a gap between softmax Transformers and Turing-completeness raise questions. The paper seems to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices · Computability, Logic, AI Algorithms