Transformers need glasses! Information over-squashing in language tasks

Federico Barbero; Andrea Banino; Steven Kapturowski; Dharshan Kumaran,; Jo\~ao G.M. Ara\'ujo; Alex Vitvitskyi; Razvan Pascanu; Petar Veli\v{c}kovi\'c

arXiv:2406.04267·cs.CL·October 28, 2024·1 cites

Transformers need glasses! Information over-squashing in language tasks

Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran,, Jo\~ao G.M. Ara\'ujo, Alex Vitvitskyi, Razvan Pascanu, Petar Veli\v{c}kovi\'c

PDF

Open Access 1 Datasets

TL;DR

This paper analyzes how information over-squashing in decoder-only Transformers causes representational collapse, leading to model errors, and proposes solutions to improve information propagation in large language models.

Contribution

It provides a theoretical analysis of information over-squashing in Transformers, revealing representational collapse and its impact on model performance, supported by empirical evidence.

Findings

01

Representational collapse occurs in Transformer models.

02

Over-squashing leads to loss of token sensitivity.

03

Low-precision formats exacerbate information loss.

Abstract

We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large language models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kanak8278/small-llm-blind-spots
dataset· 55 dl
55 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention