Continuity and Isolation Lead to Doubts or Dilemmas in Large Language Models

Hector Pasten; Felipe Urrutia; Hector Jimenez; Cristian B. Calderon; Crist\'obal Rojas; Alexander Kozachinskiy

arXiv:2505.10606·cs.LG·May 19, 2025

Continuity and Isolation Lead to Doubts or Dilemmas in Large Language Models

Hector Pasten, Felipe Urrutia, Hector Jimenez, Cristian B. Calderon, Crist\'obal Rojas, Alexander Kozachinskiy

PDF

Open Access

TL;DR

This paper reveals two fundamental phenomena, isolation and continuity, that limit Transformers' ability to learn simple patterns, supported by mathematical proofs and practical experiments.

Contribution

It introduces the phenomena of isolation and continuity in Transformers with compact positional encoding, providing mathematical proofs and empirical evidence of their limitations.

Findings

01

Isolation prevents learning multiple sequences simultaneously.

02

Continuity causes sequences to collapse into attractor basins.

03

Limitations occur in practical Transformer models.

Abstract

Understanding how Transformers work and how they process information is key to the theoretical and empirical advancement of these machines. In this work, we demonstrate the existence of two phenomena in Transformers, namely isolation and continuity. Both of these phenomena hinder Transformers to learn even simple pattern sequences. Isolation expresses that any learnable sequence must be isolated from another learnable sequence, and hence some sequences cannot be learned by a single Transformer at the same time. Continuity entails that an attractor basin forms around a learned sequence, such that any sequence falling in that basin will collapse towards the learned sequence. Here, we mathematically prove these phenomena emerge in all Transformers that use compact positional encoding, and design rigorous experiments, demonstrating that the theoretical limitations we shed light on occur on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Domain Adaptation and Few-Shot Learning · Language and cultural evolution

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dense Connections · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Softmax · Position-Wise Feed-Forward Layer