Transformers need glasses! Information over-squashing in language tasks
Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran,, Jo\~ao G.M. Ara\'ujo, Alex Vitvitskyi, Razvan Pascanu, Petar Veli\v{c}kovi\'c

TL;DR
This paper analyzes how information over-squashing in decoder-only Transformers causes representational collapse, leading to model errors, and proposes solutions to improve information propagation in large language models.
Contribution
It provides a theoretical analysis of information over-squashing in Transformers, revealing representational collapse and its impact on model performance, supported by empirical evidence.
Findings
Representational collapse occurs in Transformer models.
Over-squashing leads to loss of token sensitivity.
Low-precision formats exacerbate information loss.
Abstract
We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention
