Demystifying Singular Defects in Large Language Models
Haoqi Wang, Tong Zhang, Mathieu Salzmann

TL;DR
This paper investigates the causes of high-norm tokens in large language models, providing theoretical insights and empirical validation, and proposes practical applications for model improvement and analysis.
Contribution
It introduces a new analysis framework for understanding singular defects in LLMs, revealing key properties and mechanisms behind high-norm tokens.
Findings
Layer-wise singular directions predict token norm explosions.
Negative eigenvalues explain sudden decay in token norms.
Different pathways lead to high-norm tokens for initial and noninitial tokens.
Abstract
Large transformer models are known to produce high-norm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
