LT2: Linear-Time Looped Transformers
Chunyuan Deng, Yizhe Zhang, Rui-Jie Zhu, Yuanyuan Xu, Jiarui Liu, T. S. Eugene Ng, Hanjie Chen

TL;DR
LT2 introduces linear-time attention variants for looped transformers, enabling scalable, efficient language modeling with empirical improvements over traditional models.
Contribution
The paper develops LT2, a family of looped transformer architectures with linear-time attention, and demonstrates their effectiveness and scalability across multiple tasks.
Findings
LT2 variants achieve consistent empirical gains in language tasks.
Hybrid attention models match or surpass standard looped transformers in quality.
Converted models outperform industry-level 1B models and are competitive with 4B models.
Abstract
Looped Transformers (LT) have emerged as a powerful architecture by iterating their layers multiple times before decoding the final token. However, pairing them with full attention retains quadratic complexity, making them computationally expensive and slow. We introduce LT2 (Linear-Time Looped Transformers), a family of looped architectures that replace quadratic softmax attention with subquadratic, linear-time attention. We study two variants: LT2-linear with linear attention and LT2-sparse with sparse attention. We find that looping uniquely synergizes with these variants: it enables iterative memory refinement in linear attention and progressively expands the effective receptive field in sparse attention. We formalize these benefits theoretically and demonstrate consistent empirical gains across controlled recall, state-tracking, and language modeling tasks. We then explore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
