Simply Stabilizing the Loop via Fully Looped Transformer
Rao Fu, Zixuan Yang, Jiankun Zhang, Jing Ma, Hechang Chen, Yu Li, Yi Chang

TL;DR
The paper introduces the Fully Looped Transformer, a modification that stabilizes training of looped transformers, allowing more iterations and improving downstream task performance without increasing parameters.
Contribution
It proposes two parameter-free modifications—Fully Looped Architecture and Attention Injection—that stabilize training and enhance performance of looped transformers.
Findings
Stable training up to 12 loop iterations
Up to 13.2% improvement in downstream tasks
Enhanced adaptability to test-time compute budgets
Abstract
Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
