The Origins of Representation Manifolds in Large Language Models

Alexander Modell; Patrick Rubin-Delanchy; Nick Whiteley

arXiv:2505.18235·cs.LG·May 27, 2025·2 cites

The Origins of Representation Manifolds in Large Language Models

Alexander Modell, Patrick Rubin-Delanchy, Nick Whiteley

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper explores how features are represented as manifolds in large language models, linking geometric properties in embedding space to conceptual relatedness, and validates the theory on text embeddings.

Contribution

It introduces a manifold-based model of feature representation in neural embeddings, extending beyond linear assumptions to capture continuous, multidimensional features.

Findings

01

Cosine similarity encodes feature geometry on manifolds

02

Manifold paths relate to concept relatedness

03

Theory validated on LLM text embeddings

Abstract

There is a large ongoing scientific effort in mechanistic interpretability to map embeddings and internal representations of AI systems into human-understandable concepts. A key element of this effort is the linear representation hypothesis, which posits that neural representations are sparse linear combinations of `almost-orthogonal' direction vectors, reflecting the presence or absence of different features. This model underpins the use of sparse autoencoders to recover features from representations. Moving towards a fuller model of features, in which neural representations could encode not just the presence but also a potentially continuous and multidimensional value for a feature, has been a subject of intense recent discourse. We describe why and how a feature might be represented as a manifold, demonstrating in particular that cosine similarity in representation space may encode…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 2

Strengths

- The paper has good mathematical rigor and the formalism introduced allows us to tackle this problem going beyond the standard LRH. - The concepts are well-defined and the hypotheses considered are sensible. I liked that all mathematical statements and definitions were accompanied by intuitive explanations that are more digestible. - The paper is well-written and tackles an important research area. Mechanistic interpretability is a very relevant problem and we need more theoretical work in thi

Weaknesses

The main weakness of the paper is the experimental design, which does not seem to support the theoretical findings with enough evidence. This is important, because the main results of the paper (which are Proposition 1 and Theorem 1) depend on strong hypotheses (Hypothesis 1 and Hypothesis 2, respectively). Hence, it is crucial that the experiments are able to support these hypotheses (and thus the theoretical results) with enough evidence. - The experimental design is lacking in details in the

Reviewer 02Rating 4Confidence 4

Strengths

The paper looks to generalise the linear representation hypothesis (LRH). Hypotheses are presented that imply a linear relationship between path distances on feature manifolds in embedding space and distance in an assumed feature (metric) space. Empirical results for 3 features (colour/dates) show an apparent linear relationship, subject to feature parameterisation.

Weaknesses

* The paper is titled "The origins ...", but no *origins* seem to be explained (or even discussed). Sufficient conditions are posited (Hyp 1 & 2) that imply path lengths in feature space are directly proportional to those in embedding space. Even if all true, this simply pushes back the question one level to why do those hypotheses hold. The existence of manifolds is well known, it is not clear that the paper explains "their origins". * While the paper claims to give a simpler definition (Def 2)

Reviewer 03Rating 6Confidence 4

Strengths

The paper is written nicely with clear definitions to properly formalize terms being thrown around in interpretability ("feature") or to connect with prior work that tries to establish a framing for how we should characterize model representations. Though whether the field will adopt the authors' provided definitions/framing is to be seen, the authors provide compelling evidence that their provided framing (i.e., defining features as a metric space the continuous correspondence hypothesis) fits

Weaknesses

To be honest I'm not sure if the title "The origins of representation manifolds in large language models" is a bit too broad of a title that doesn't reflect the findings/claims of the paper. In particular, I think the paper has a heavy focus on the use of a metric space, and in part studying the role that distance (in either metric space or embedding space) plays to support their definition of features as metric spaces. Albeit the paper being a fun read, its claims and evidence are mostly based

Code & Models

Repositories

alexandermodell/representation-manifolds
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques