Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations
Yize Zhao, Christos Thrampoulidis

TL;DR
This paper reveals how next-token prediction in language models implicitly organizes semantic information through matrix factorization, leading to emergent semantic hierarchies and interpretable categories.
Contribution
It introduces a mathematical framework showing how NTP optimization guides models to factor semantic matrices via SVD, uncovering semantic structures without explicit encoding.
Findings
Semantic concepts emerge early during training.
Models recover diverse semantic categories like entities and topics.
Singular value hierarchy reflects semantic granularity.
Abstract
We investigate how next-token prediction (NTP) optimization leads language models to extract and organize semantic structure from text. Our analysis, based on a tractable mathematical model and controlled synthetic data, reveals that NTP implicitly guides models to factor a centered support matrix encoding context-to-next-token co-occurrence patterns via singular value decomposition (SVD). While models never explicitly construct this matrix, learned word and context embeddings converge to its SVD factors, with singular vectors encoding latent semantic concepts through their sign patterns. We demonstrate that concepts corresponding to larger singular values are learned earlier during training, yielding a natural semantic hierarchy where broad categories emerge before fine-grained ones. This insight motivates orthant-based clustering, a method that combines concept signs to identify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies
MethodsSpectral Clustering
