TL;DR
This paper investigates the role of learned memory tokens in Universal Transformers with Adaptive Computation Time on Sudoku-Extreme, revealing their necessity, optimal count, and impact on resource trade-offs, along with training challenges and solutions.
Contribution
It demonstrates the essential role of memory tokens, identifies a router initialization trap, and shows how ACT improves training stability and resource efficiency.
Findings
Memory tokens are necessary for non-trivial performance.
Optimal memory token count is around 8-32, with a sharp lower threshold at 8.
ACT reduces seed variance and enables specialization of attention heads.
Abstract
We study learned memory tokens as a computational scratchpad for a single-block Universal Transformer with Adaptive Computation Time (ACT) on Sudoku-Extreme, a combinatorial reasoning benchmark. Memory tokens are empirically necessary: no configuration without them reaches non-trivial performance. The optimal count has a sharp lower threshold (T=0 always fails, T=8 reliably succeeds) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and a dilution boundary at T=64. Under halt-side pressure (lambda warmup), mean halt drops monotonically with memory size across the plateau (from 11.6 at T=8 to 8.3 at T=64), showing that memory tokens and ponder depth substitute as resources at fixed accuracy. We also identify a router initialization trap that causes the majority of training runs to fail: both default zero-bias and Graves' recommended positive bias settle into a shallow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
