From Directions to Regions: Decomposing Activations in Language Models via Local Geometry
Or Shafran, Shaked Ronen, Omri Fahn, Shauli Ravfogel, Atticus Geiger, Mor Geva

TL;DR
This paper introduces a scalable, unsupervised method using Mixture of Factor Analyzers to decompose language model activations into local geometric regions, capturing complex nonlinear concept structures and improving interpretability.
Contribution
It proposes a novel MFA-based approach for activation decomposition that models local geometry, outperforming existing methods in capturing complex concept structures in language models.
Findings
MFA captures complex, nonlinear activation structures.
MFA outperforms unsupervised baselines in localization tasks.
MFA achieves strong steering performance, rivaling supervised methods.
Abstract
Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
