Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Moritz Miller; Florent Draye; Bernhard Sch\"olkopf

arXiv:2602.04718·cs.LG·February 5, 2026

Identifying Intervenable and Interpretable Features via Orthogonality Regularization

Moritz Miller, Florent Draye, Bernhard Sch\"olkopf

PDF

Open Access

TL;DR

This paper introduces an orthogonality regularization technique to disentangle features in language models, enhancing interpretability and causal intervention capabilities without sacrificing performance.

Contribution

It proposes a novel orthogonality penalty that yields identifiable, modular features suitable for causal intervention in language models.

Findings

01

Orthogonality regularization improves feature interpretability.

02

Disentangled features enable isolated causal interventions.

03

Performance on target datasets remains stable with orthogonality penalty.

Abstract

With recent progress on fine-tuning language models around a fixed sparse autoencoder, we disentangle the decoder matrix into almost orthogonal features. This reduces interference and superposition between the features, while keeping performance on the target dataset essentially unchanged. Our orthogonality penalty leads to identifiable features, ensuring the uniqueness of the decomposition. Further, we find that the distance between embedded feature explanations increases with stricter orthogonality penalty, a desirable property for interpretability. Invoking the $Independent Causal Mechanisms$ principle, we argue that orthogonality promotes modular representations amenable to causal intervention. We empirically show that these increasingly orthogonalized features allow for isolated interventions. Our code is available under $https://github.com/mrtzmllr/sae-icm$ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare