Continuity and Isolation Lead to Doubts or Dilemmas in Large Language Models
Hector Pasten, Felipe Urrutia, Hector Jimenez, Cristian B. Calderon, Crist\'obal Rojas, Alexander Kozachinskiy

TL;DR
This paper reveals two fundamental phenomena, isolation and continuity, that limit Transformers' ability to learn simple patterns, supported by mathematical proofs and practical experiments.
Contribution
It introduces the phenomena of isolation and continuity in Transformers with compact positional encoding, providing mathematical proofs and empirical evidence of their limitations.
Findings
Isolation prevents learning multiple sequences simultaneously.
Continuity causes sequences to collapse into attractor basins.
Limitations occur in practical Transformer models.
Abstract
Understanding how Transformers work and how they process information is key to the theoretical and empirical advancement of these machines. In this work, we demonstrate the existence of two phenomena in Transformers, namely isolation and continuity. Both of these phenomena hinder Transformers to learn even simple pattern sequences. Isolation expresses that any learnable sequence must be isolated from another learnable sequence, and hence some sequences cannot be learned by a single Transformer at the same time. Continuity entails that an attractor basin forms around a learned sequence, such that any sequence falling in that basin will collapse towards the learned sequence. Here, we mathematically prove these phenomena emerge in all Transformers that use compact positional encoding, and design rigorous experiments, demonstrating that the theoretical limitations we shed light on occur on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Domain Adaptation and Few-Shot Learning · Language and cultural evolution
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Softmax · Position-Wise Feed-Forward Layer
