You Do Not Fully Utilize Transformer's Representation Capacity
Gleb Gerasimov, Yaroslav Aksenov, Nikita Balagansky, Viacheslav Sinii, Daniil Gavrilov

TL;DR
This paper identifies a limitation in standard Transformers related to representation collapse and proposes Layer-Integrated Memory (LIMe), a lightweight extension that enhances representation capacity, improves convergence, and boosts performance across various tasks.
Contribution
The paper introduces LIMe, a novel method that leverages existing key-value buffers and learns routing weights to integrate multi-layer representations without increasing hidden size.
Findings
LIMe achieves faster convergence and lower perplexity per FLOP.
LIMe improves accuracy on synthetic reasoning benchmarks.
LIMe maintains higher value-vector entropy and better token separability.
Abstract
In contrast to RNNs, which compress their history into a single hidden state, Transformers can attend to all past tokens directly. However, standard Transformers rely solely on the hidden state from the previous layer to represent the entire context. We show that this design choice induces representation collapse and degrades performance. To address this issue, we introduce Layer-Integrated Memory (LIMe), a lightweight extension that leverages existing key-value buffers and learns per-head, per-layer routing weights to integrate representations from all previous layers with negligible overhead. Through extensive experiments-including language modeling, synthetic reasoning benchmarks, and very deep architectures-LIMe consistently achieves faster convergence, lower perplexity per FLOP, and substantial accuracy improvements on synthetic tasks while preserving higher value-vector entropy…
Peer Reviews
Decision·Submitted to ICLR 2026
While the method shares similarity with existing methods such as DenseFormer, the correct placement of weighted averages is important and in addition to superior performance on the experiments, yields side-benefits such as the ability to re-use the KV cache. The authors report additional investigative results such as the analysis done on the learned router weights.
In Section 5.1, it would be very helpful to have the random baseline for each task. In particular, that results that are reported for several of tasks seem near-chance (e.g. WiC). There is also no confidence intervals reported which makes it very hard to determine the significance of the improvements. Overall this makes me question the efficacy of the method in general language modeling. It is confusing to refer to LLaMA in Table 1. Based on my understanding, this is only a model with the same
Comprehensive experimental design: The evaluation spans multiple dimensions: language modeling perplexity, mathematical reasoning on GSM8K, and synthetic tasks with controlled difficulty levels. The representation collapse analysis combines entropy measurements, linear separability tests, and grammatical probing to validate the core hypothesis from different angles. The routing weight analysis provides interpretability by revealing which layer representations the model prefers to access. This is
1. Limited novelty over prior work. The core mechanism of using learned weights to aggregate multi-layer representations appears in Transparent Attention (Bapna et al., EMNLP 2018), which uses trainable softmax-normalized weights to combine encoder layer outputs in NMT decoder cross-attention. The mathematical formulation resembles that prior work, with the main difference being application to decoder-only self-attention. More recently, Hyper-Connections (Zhu et al., Sept 2024) addresses represe
I am a bit of the fence regarding this paper. In terms of strengths, I believe that the proposed idea in the paper is simple and elegant. The paper is clearly written and easy to follow. The experimental evaluations are convincing.
My main concern with the paper is its relation to previous work, and especially its significance with respect to these. First, I believe that the paper does not make a great job discussing the difference with previous work such as DenseFormer, or Value Residual Learning. More precisely, I think that the idea of combining the representations from multiple previous layers instead of just using the representation from the previous layer is not new. The contributions of the paper are thus mostly ab
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Quality and Harmonics
MethodsLocal Interpretable Model-Agnostic Explanations
