Identifying Intervenable and Interpretable Features via Orthogonality Regularization
Moritz Miller, Florent Draye, Bernhard Sch\"olkopf

TL;DR
This paper introduces an orthogonality regularization technique to disentangle features in language models, enhancing interpretability and causal intervention capabilities without sacrificing performance.
Contribution
It proposes a novel orthogonality penalty that yields identifiable, modular features suitable for causal intervention in language models.
Findings
Orthogonality regularization improves feature interpretability.
Disentangled features enable isolated causal interventions.
Performance on target datasets remains stable with orthogonality penalty.
Abstract
With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare
