Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
Victor Conchello Vendrell, Arnau Padres Masdemont, Niccol\`o Grillo, Jordi Ros-Giralt, Arash Behboodi, Fabio Valerio Massoli

TL;DR
MELT introduces a memory-efficient transformer architecture that decouples reasoning depth from memory use, enabling scalable iterative reasoning in language models without increasing memory footprint.
Contribution
The paper proposes MELT, a novel architecture with shared KV caches and a training method that enables constant-memory iterative reasoning in language models.
Findings
MELT outperforms standard LLMs of similar size after fine-tuning.
MELT maintains a memory footprint comparable to standard models, much smaller than Ouro.
MELT achieves scalable reasoning without sacrificing performance.
Abstract
Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by iteratively updating internal representations while retaining a standard Key-Value (KV) cache across iterations, causing memory consumption to grow linearly with reasoning depth. Consequently, increasing the number of reasoning iterations can lead to prohibitive memory usage, limiting the practical scalability of such architectures. In this work, we propose Memory-Efficient Looped Transformer (MELT), a novel architecture that decouples reasoning depth from memory consumption. Instead of using a standard KV cache per layer and loop, MELT maintains a single KV cache per layer that is shared across reasoning loops. This cache is updated over time via a learnable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
