TL;DR
This paper demonstrates that large language model embeddings encode semantic information in a low-dimensional structure similar to human semantic ratings, revealing entangled features and implications for model steering.
Contribution
It uncovers that LLM embeddings contain a low-dimensional semantic structure aligned with human ratings, providing insights into their internal semantic representations.
Findings
Semantic associations in LLM embeddings correlate with human ratings.
A 3-dimensional subspace captures most semantic variation in embeddings.
Shifting tokens along semantic directions affects related features proportionally.
Abstract
Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the semantic associations encoded in the embedding matrices of large language models (LLMs) exhibit a similar structure. We show that the projections of words on semantic directions defined by antonym pairs (e.g. kind - cruel) correlate highly with human ratings, and further find that these projections effectively reduce to a 3-dimensional subspace within LLM embeddings, closely resembling the patterns derived from human survey responses. Moreover, we find that shifting tokens along one semantic direction causes off-target effects on geometrically aligned features proportional to their cosine similarity. These findings suggest that semantic features are entangled within LLMs similarly to how…
Peer Reviews
Decision·Submitted to ICLR 2026
- It is important to study how LLMs understand semantics and how their internal (low-dimensional) representations work. Using psychology studies and well-established theories on human semantics is a valuable way of doing so, and these kinds of interdisciplinary approaches seem particularly relevant - The different kinds of experiments nicely build on each other and feel like a natural progression. First they study the correlation between the human ratings and token embeddings, then they perform
I have three main criticisms of the framework and experimental set-up of the paper. - First, the paper only focuses on the LLM static embedding matrices, without taking the activations into account. The authors justify this in page 2, saying that “embedding and unembedding matrices also warrant attention”, but it seems too simplifying of an assumption to forget about the context and the actual transformer architecture of the LLM. Particularly because all of the experiments are run on LLMs, the c
The paper is highly readable and the topic of understanding how LLMs represent meaning, particularly in relation to human cognitive models (Evaluation, Potency, Activity), is of broad interest.
The paper is insufficiently substantive for a top-tier machine learning conference. It offers no new theoretical insight into how semantics are encoded but provides additional empirical evidence for an existing, well-known principle that the geometric arrangement of embeddings (angles and vector addition) encodes semantic meaning. The central finding that feature "entanglement" leads to predictable off-target effects is fundamentally the expected mathematical consequence of performing linear ma
The paper is well written and clearly structured. The experimental design is solid and supports the authors’ claims effectively. The findings are interesting and open promising research directions at the intersection of cognitive science and large language models.
The visualizations could be clearer. For example, the correlation matrices presented in the appendix are difficult to interpret and compare in their current form. Moreover, the paper’s practical impact appears limited. While the results are conceptually interesting, it remains unclear how these insights could be applied to improve LLM architectures or their downstream performance.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
