Simply Stabilizing the Loop via Fully Looped Transformer

Rao Fu; Zixuan Yang; Jiankun Zhang; Jing Ma; Hechang Chen; Yu Li; Yi Chang

arXiv:2605.18797·cs.LG·May 20, 2026

Simply Stabilizing the Loop via Fully Looped Transformer

Rao Fu, Zixuan Yang, Jiankun Zhang, Jing Ma, Hechang Chen, Yu Li, Yi Chang

PDF

TL;DR

The paper introduces the Fully Looped Transformer, a modification that stabilizes training of looped transformers, allowing more iterations and improving downstream task performance without increasing parameters.

Contribution

It proposes two parameter-free modifications—Fully Looped Architecture and Attention Injection—that stabilize training and enhance performance of looped transformers.

Findings

01

Stable training up to 12 loop iterations

02

Up to 13.2% improvement in downstream tasks

03

Enhanced adaptability to test-time compute budgets

Abstract

Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.