Latent Semantic Manifolds in Large Language Models
Mohamed A. Mabrok

TL;DR
This paper introduces a geometric framework for understanding how large language models encode semantics in continuous spaces, revealing fundamental limits and properties of their internal representations.
Contribution
It develops a mathematical model interpreting LLM hidden states as points on a Riemannian manifold, and proves key theorems about semantic distortion and expressibility gaps.
Findings
Universal hourglass intrinsic dimension profiles across models
Linear scaling law for the semantic expressibility gap
Persistent boundary-proximal representations invariant to scale
Abstract
Large Language Models (LLMs) perform internal computations in continuous vector spaces yet produce discrete tokens -- a fundamental mismatch whose geometric consequences remain poorly understood. We develop a mathematical framework that interprets LLM hidden states as points on a latent semantic manifold: a Riemannian submanifold equipped with the Fisher information metric, where tokens correspond to Voronoi regions partitioning the manifold. We define the expressibility gap, a geometric measure of the semantic distortion from vocabulary discretization, and prove two theorems: a rate-distortion lower bound on distortion for any finite vocabulary, and a linear volume scaling law for the expressibility gap via the coarea formula. We validate these predictions across six transformer architectures (124M-1.5B parameters), confirming universal hourglass intrinsic dimension profiles, smooth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education
