Efficient Turing Machine Simulation with Transformers

Qian Li; Yuyi Wang

arXiv:2512.00003·cs.CC·December 3, 2025

Efficient Turing Machine Simulation with Transformers

Qian Li, Yuyi Wang

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that constant bit-size Transformers can simulate Turing machines efficiently with optimal context windows and minimal reasoning steps, advancing understanding of their computational power.

Contribution

It introduces a novel, more efficient method for simulating multi-tape Turing machines with Transformers, reducing reasoning steps and leveraging sparse attention.

Findings

01

Transformers can simulate Turing machines with optimal $O(s(n))$ context window.

02

Sparse attention with fixed geometric offsets suffices for universal computation.

03

The simulation improves time and space complexity using multi-queue TMs.

Abstract

Constant bit-size Transformers are known to be Turing complete, but existing constructions require $Ω (s (n))$ chain-of-thought (CoT) steps per simulated Turing machine (TM) step, leading to impractical reasoning lengths. In this paper, we significantly reduce this efficiency gap by proving that any $(t (n), s (n))$ -bounded multi-tape TM can be simulated by a constant bit-size Transformer with an optimal $O (s (n))$ -long context window and only $O (s (n)^{c})$ CoT steps per TM step, where $c > 0$ can be made arbitrarily small by letting the Transformers' head-layer product sufficiently large. In addition, our construction shows that sparse attention with fixed geometric offsets suffices for efficient universal computation. Our proof leverages multi-queue TMs as a bridge. The main technical novelty is a more efficient simulation of multi-tape TMs by synchronous multi-queue TMs, improving both…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- Contributes to emerging theoretical understanding of the Turing completeness of Transformers. - Improves over prior work by reducing the CoT overhead in a constant bit-size setting.

Weaknesses

- The design of the model is nonstandard. In particular, adding relative positional encoding vectors in line 195 appears confusing: is the idea that the encoding vector pos(i-j) depends on both the current position j and a later position i from which an attention head looks back at position j? This seems to make both parallel training impossible and autoregressive decoding extremely inefficient, as the whole transformer activations would have to be recomputed throughout the entire context for ev

Reviewer 02Rating 8Confidence 4

Strengths

1. I find the main result about the space requirements needed to simulate a Turing machine with constant-bit-size CoT transformers to be valuable. 2. The intermediate result converting multitape Turing machines to multiqueue Turing machines is interesting in its own right and technically innovative. 3. The high-level technical plan and proofs are clear and rigorous

Weaknesses

### Make Dependence of Space/Context Window on k' Explicit In theorem 2, how does the space O(s) depend on the queue factor k'? The way you are reducing time overhead is increasing k', so it would be nice to understand how space scales with this. It would also be good to understand how this shows up in the main result about transformers: you say that we can make the time overhead arbitrarily small, but how does this increase the context window we need? ### Theorem 3 Suggestions Overall, the

Reviewer 03Rating 4Confidence 5

Strengths

This work improves on previously presented results on efficient TM simulations by constant bit-size Transformers. In particular, it is shown how to reduce the number of CoT steps per simulated step from O(s(n)) to O(s(n)^c), for any constant c>0. The analysis is non-trivial and the result is in line with what is actively being researched by many authors studying Transformers through the lens of computational complexity.

Weaknesses

Although the paper improves on known results, the work presents incremental progress in the field. In particular, as shown in Table 1, the submitted paper improves on the results of Li and Wang (2025) essentially only with respect to the number of CoT per single TM step. Importantly, the simulation results (presented in Table 1) do not take into account the total CoT length of the simulation. So, by simulating a (t(n), s(n)) time-space bounded multi-tape TMs with a single-tape that increases tim

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsQuantum Computing Algorithms and Architecture · Ferroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques