Not All Language Model Features Are One-Dimensionally Linear
Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, Max Tegmark

TL;DR
This paper investigates whether language model representations are inherently multi-dimensional, developing methods to identify such features, and demonstrating their interpretability and computational relevance in models like GPT-2 and Mistral 7B.
Contribution
It introduces a scalable autoencoder-based approach to discover multi-dimensional features in language models, revealing interpretable circular features and their role in computation.
Findings
Identified interpretable circular features representing days and months.
Demonstrated these features are used in modular arithmetic tasks.
Provided evidence that multi-dimensional features are fundamental to certain model behaviors.
Abstract
Recent work has proposed that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper tackles a timely and important question in mechanistic interpretability by formalizing (inherently) multi-dimensional features in LLMs. - The visual examples of circular representations are clear and compelling. - I think the high-level idea of formalizing multi-dimensional, irreducible features is sensible and useful. - Overall, the experiments are comprehensive with a variety of interesting results.
- I think there are some conceptual questions about the definitions of features and irreducibility. - Conceptually, I think the definition of irreducibility is somewhat incomplete. First, there can be concepts that are correlated but can still be disentangled for separate interventions, so equating separability with independence is can be limiting. Second, I can't tell for sure if the definition of reducibility is exhaustive. (Can there be a third category?) - I have asked about some of
1) The updated proposed definition for the superposition hypothesis is sound and may lead to the discovery of other interesting structures such as the ones presented in the paper. Even if this direction does not lead to further scientific discoveries, the discovered representations themselves are an interesting finding. 2) In the paper, features are extracted from state-of-the-art LLMs, showcasing the existence of actual multi-dimensional features of circular nature **"in the wild"**.
1) If I understand correctly, the proposed algorithm can extract interpretable features, however, I imagine there is a good amount of them that is not easily--if at all--interpretable. Adding some of your insight on how many potentially interesting multi-dimensional features are among the ones extracted from the algorithm could improve the article. 2) If I am not mistaken, there is not an ablation on how much the threshold parameter T affects the extracted clusters: Having an understanding of w
The extension of the linear representation hypothesis from one-dimensional to multi-dimensional one provides good insights for researchers in the mechanistic interpretability field. Concrete and extensive empirical results show that multi-dimensional features exist in LLMs. In particular, the circular representations of days of the week are interesting.
Some details in formalization and experiments are not clear. Please see the questions below.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Cosine Annealing · Discriminative Fine-Tuning · Attention Dropout · Linear Layer · Multi-Head Attention · Residual Connection · Weight Decay
