Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

Grigory Sapunov

arXiv:2604.21999·cs.LG·May 5, 2026

Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning

Grigory Sapunov

PDF

1 Repo

TL;DR

This paper investigates the role of learned memory tokens in Universal Transformers with Adaptive Computation Time on Sudoku-Extreme, revealing their necessity, optimal count, and impact on resource trade-offs, along with training challenges and solutions.

Contribution

It demonstrates the essential role of memory tokens, identifies a router initialization trap, and shows how ACT improves training stability and resource efficiency.

Findings

01

Memory tokens are necessary for non-trivial performance.

02

Optimal memory token count is around 8-32, with a sharp lower threshold at 8.

03

ACT reduces seed variance and enables specialization of attention heads.

Abstract

We study learned memory tokens as a computational scratchpad for a single-block Universal Transformer with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. Memory tokens are empirically necessary: no configuration without them reaches non-trivial performance. The optimal count has a sharp lower threshold (T=0 always fails, T=8 reliably succeeds) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and a dilution boundary at T=64. Under halt-side pressure (lambda warmup), mean halt drops monotonically with memory size across the plateau (from 11.6 at T=8 to 8.3 at T=64), showing that memory tokens and ponder depth substitute as resources at fixed accuracy. We also identify a router initialization trap that causes the majority of training runs to fail: both default zero-bias and Graves' recommended positive bias settle into a shallow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

che-shr-cat/utm-jax
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.