Sparse Layers are Critical to Scaling Looped Language Models
Ryan Lee, Jacob Biloki, Edward J. Hu, Jonathan May

TL;DR
Looped language models, especially with Mixture-of-Experts, scale better and offer improved compute-quality trade-offs with early exits, enabling memory and inference savings.
Contribution
This paper demonstrates that Looped-MoE models outperform standard transformers in scaling and efficiency, highlighting the importance of expert routing divergence and early exit points.
Findings
Looped-MoE models scale better than standard models.
Looped models with early exits have superior compute-quality trade-offs.
Shared layers in looped models recover expressivity without extra parameters.
Abstract
Looped language models repeat a set of transformer layers through depth, reducing memory costs and providing natural early-exit points at loop boundaries. However, looped models do not scale as favorably as standard transformers with unique layers. We compare standard and Mixture-of-Experts (MoE) transformers, with and without looping, and find two main results. First, we find Looped-MoE models scale better than the standard baseline while dense looped models do not. We trace this to routing divergence between loops: in Looped-MoE models, different experts are activated on each pass through the same shared layers, recovering expressivity without additional parameters. Our second finding is that looped models have better compute-quality trade-offs with early exits than standard models. Because each loop ends with the same layers that produce the final output, loop boundaries are superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
