Sparse Layers are Critical to Scaling Looped Language Models

Ryan Lee; Jacob Biloki; Edward J. Hu; Jonathan May

arXiv:2605.09165·cs.LG·May 12, 2026

Sparse Layers are Critical to Scaling Looped Language Models

Ryan Lee, Jacob Biloki, Edward J. Hu, Jonathan May

PDF

TL;DR

Looped language models, especially with Mixture-of-Experts, scale better and offer improved compute-quality trade-offs with early exits, enabling memory and inference savings.

Contribution

This paper demonstrates that Looped-MoE models outperform standard transformers in scaling and efficiency, highlighting the importance of expert routing divergence and early exit points.

Findings

01

Looped-MoE models scale better than standard models.

02

Looped models with early exits have superior compute-quality trade-offs.

03

Shared layers in looped models recover expressivity without extra parameters.

Abstract

Looped language models repeat a set of transformer layers through depth, reducing memory costs and providing natural early-exit points at loop boundaries. However, looped models do not scale as favorably as standard transformers with unique layers. We compare standard and Mixture-of-Experts (MoE) transformers, with and without looping, and find two main results. First, we find Looped-MoE models scale better than the standard baseline while dense looped models do not. We trace this to routing divergence between loops: in Looped-MoE models, different experts are activated on each pass through the same shared layers, recovering expressivity without additional parameters. Our second finding is that looped models have better compute-quality trade-offs with early exits than standard models. Because each loop ends with the same layers that produce the final output, loop boundaries are superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.