LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation
Ahmadreza Jeddi, Marco Ciccone, Babak Taati

TL;DR
LoopFormer introduces a flexible, budget-conditioned looped Transformer architecture that adapts its reasoning depth dynamically, improving efficiency and performance in language modeling and reasoning tasks under variable compute constraints.
Contribution
It proposes a shortcut-consistency training scheme for looped Transformers, enabling adaptive reasoning depth across variable-length trajectories.
Findings
Robust performance on language modeling benchmarks under compute constraints
Graceful scaling with additional computational budget
Effective latent reasoning with adaptive loop iterations
Abstract
Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop…
Peer Reviews
Decision·ICLR 2026 Poster
The proposed architecture is reasonably simple and well motivated. The overall direction, transformer-like that can dynamically modulate compute, is important. The results showing the nice monotonicity of perplexity or accuracy vs. FLOPs (Fig 2) is great.
The experimental results are underwhelming, and maybe I missed the point, but it is unclear to me how this model improves wrt a vanilla transformer. The results are presented in such a way that the training budget is not equalized (if I am correct? I do not understand the first sentence of 4.1), the inference flops are, and the parameter counts are not, although this is generally not the limiting factor. Training budgets should have been equalized (e.g. pick the nb of training iterations per mo
1. The paper tackles a clear, practical, and important problem. Enabling flexible, elastic compute in parameter-efficient models like looped Transformers is a highly valuable research direction. 2. The paper is well written and easy to follow. 3. The experiments are thorough and convincing. In addition to strong task performance, the authors provide a compelling explanation for why LoopFormer works by analyzing metrics like curvature, anisotropy, and CKA similarity. They demonstrate that baselin
1. The training procedure (Algorithm 1) requires two forward passes per batch (for the full and short trajectories) to compute the consistency loss. This appears to roughly double the training cost compared to a standard looped model. The paper mentions this as a limitation but does not quantify it. A brief analysis of the training FLOPs/time overhead vs. a Base-Loop or TMLT baseline would be valuable for assessing the practical trade-offs. 2. The paper heavily emphasizes "latent reasoning," usi
* The paper is well written and easy to follow, with clear motivation and setups * The motivation is clearly presented, and the transition from fixed-depth looped Transformers to elastic-depth design feels natural. * Experiments are reasonably comprehensive, evaluating both variable loop lengths and the effect of the proposed shortcut-consistency loss. * The main claims are well supported
* The degree of novelty is not bad but moderate. While the proposed elastic-depth formulation and shortcut-consistency loss are well designed, they extend existing time-modulated looped Transformer frameworks rather than introducing a fundamentally new paradigm. * the paper does not provide theoretical intuition or analysis explaining why combining t and $\Delta t$ through sinusoidal modulation is a good choice here
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
