Not All Language Model Features Are One-Dimensionally Linear

Joshua Engels; Eric J. Michaud; Isaac Liao; Wes Gurnee; Max Tegmark

arXiv:2405.14860·cs.LG·February 28, 2025·3 cites

Not All Language Model Features Are One-Dimensionally Linear

Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, Max Tegmark

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper investigates whether language model representations are inherently multi-dimensional, developing methods to identify such features, and demonstrating their interpretability and computational relevance in models like GPT-2 and Mistral 7B.

Contribution

It introduces a scalable autoencoder-based approach to discover multi-dimensional features in language models, revealing interpretable circular features and their role in computation.

Findings

01

Identified interpretable circular features representing days and months.

02

Demonstrated these features are used in modular arithmetic tasks.

03

Provided evidence that multi-dimensional features are fundamental to certain model behaviors.

Abstract

Recent work has proposed that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- The paper tackles a timely and important question in mechanistic interpretability by formalizing (inherently) multi-dimensional features in LLMs. - The visual examples of circular representations are clear and compelling. - I think the high-level idea of formalizing multi-dimensional, irreducible features is sensible and useful. - Overall, the experiments are comprehensive with a variety of interesting results.

Weaknesses

- I think there are some conceptual questions about the definitions of features and irreducibility. - Conceptually, I think the definition of irreducibility is somewhat incomplete. First, there can be concepts that are correlated but can still be disentangled for separate interventions, so equating separability with independence is can be limiting. Second, I can't tell for sure if the definition of reducibility is exhaustive. (Can there be a third category?) - I have asked about some of

Reviewer 02Rating 8Confidence 3

Strengths

1) The updated proposed definition for the superposition hypothesis is sound and may lead to the discovery of other interesting structures such as the ones presented in the paper. Even if this direction does not lead to further scientific discoveries, the discovered representations themselves are an interesting finding. 2) In the paper, features are extracted from state-of-the-art LLMs, showcasing the existence of actual multi-dimensional features of circular nature **"in the wild"**.

Weaknesses

1) If I understand correctly, the proposed algorithm can extract interpretable features, however, I imagine there is a good amount of them that is not easily--if at all--interpretable. Adding some of your insight on how many potentially interesting multi-dimensional features are among the ones extracted from the algorithm could improve the article. 2) If I am not mistaken, there is not an ablation on how much the threshold parameter T affects the extracted clusters: Having an understanding of w

Reviewer 03Rating 6Confidence 4

Strengths

The extension of the linear representation hypothesis from one-dimensional to multi-dimensional one provides good insights for researchers in the mechanistic interpretability field. Concrete and extensive empirical results show that multi-dimensional features exist in LLMs. In particular, the circular representations of days of the week are interesting.

Weaknesses

Some details in formalization and experiments are not clear. Please see the questions below.

Code & Models

Repositories

joshengels/multidimensionalfeatures
pytorchOfficial

Videos

Not All Language Model Features Are One-Dimensionally Linear· slideslive

Taxonomy

TopicsNatural Language Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Layer Normalization · Cosine Annealing · Discriminative Fine-Tuning · Attention Dropout · Linear Layer · Multi-Head Attention · Residual Connection · Weight Decay