TL;DR
This paper introduces Hyperloop Transformers, a parameter-efficient architecture using looped Transformers and hyper-connections, outperforming traditional models in memory-constrained language modeling tasks.
Contribution
The paper proposes a novel Hyperloop Transformer architecture that reuses layers and adds hyper-connections, reducing parameters while maintaining or improving performance.
Findings
Hyperloop Transformers outperform depth-matched baselines with 50% fewer parameters.
Performance persists after weight quantization, indicating robustness.
Hyperloop Transformers are suitable for memory-efficient language modeling.
Abstract
LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory footprint, thus motivating parameter-efficient architectures for language modeling. This paper describes a simple architecture that improves the parameter-efficiency of LLMs. Our architecture makes use of looped Transformers as a core primitive, which reuse Transformer layers across depth and are thus more parameter-efficient than ordinary (depth-matched) Transformers. We organize the looped Transformer into three blocks--begin, middle, and end blocks--where each block itself consists of multiple Transformer layers, and only the middle block is applied recurrently across depth. We augment the looped middle block with hyper-connections (Xie et al., 2026),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
