The Origins of Representation Manifolds in Large Language Models
Alexander Modell, Patrick Rubin-Delanchy, Nick Whiteley

TL;DR
This paper explores how features are represented as manifolds in large language models, linking geometric properties in embedding space to conceptual relatedness, and validates the theory on text embeddings.
Contribution
It introduces a manifold-based model of feature representation in neural embeddings, extending beyond linear assumptions to capture continuous, multidimensional features.
Findings
Cosine similarity encodes feature geometry on manifolds
Manifold paths relate to concept relatedness
Theory validated on LLM text embeddings
Abstract
There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper has good mathematical rigor and the formalism introduced allows us to tackle this problem going beyond the standard LRH. - The concepts are well-defined and the hypotheses considered are sensible. I liked that all mathematical statements and definitions were accompanied by intuitive explanations that are more digestible. - The paper is well-written and tackles an important research area. Mechanistic interpretability is a very relevant problem and we need more theoretical work in thi
The main weakness of the paper is the experimental design, which does not seem to support the theoretical findings with enough evidence. This is important, because the main results of the paper (which are Proposition 1 and Theorem 1) depend on strong hypotheses (Hypothesis 1 and Hypothesis 2, respectively). Hence, it is crucial that the experiments are able to support these hypotheses (and thus the theoretical results) with enough evidence. - The experimental design is lacking in details in the
The paper looks to generalise the linear representation hypothesis (LRH). Hypotheses are presented that imply a linear relationship between path distances on feature manifolds in embedding space and distance in an assumed feature (metric) space. Empirical results for 3 features (colour/dates) show an apparent linear relationship, subject to feature parameterisation.
* The paper is titled "The origins ...", but no *origins* seem to be explained (or even discussed). Sufficient conditions are posited (Hyp 1 & 2) that imply path lengths in feature space are directly proportional to those in embedding space. Even if all true, this simply pushes back the question one level to why do those hypotheses hold. The existence of manifolds is well known, it is not clear that the paper explains "their origins". * While the paper claims to give a simpler definition (Def 2)
The paper is written nicely with clear definitions to properly formalize terms being thrown around in interpretability ("feature") or to connect with prior work that tries to establish a framing for how we should characterize model representations. Though whether the field will adopt the authors' provided definitions/framing is to be seen, the authors provide compelling evidence that their provided framing (i.e., defining features as a metric space the continuous correspondence hypothesis) fits
To be honest I'm not sure if the title "The origins of representation manifolds in large language models" is a bit too broad of a title that doesn't reflect the findings/claims of the paper. In particular, I think the paper has a heavy focus on the use of a metric space, and in part studying the role that distance (in either metric space or embedding space) plays to support their definition of features as metric spaces. Albeit the paper being a fun read, its claims and evidence are mostly based
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
