Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
Sungjun Lim, Heedong Kim, Andrew Lee, and Kyungwoo Song

TL;DR
This paper introduces the Geometry-Adaptive Explainer (GAE), a method that realigns dictionary-based interpretability tools with out-of-distribution data, improving faithfulness without retraining or labels.
Contribution
The paper formalizes the faithfulness gap caused by distribution shift and proposes GAE, which adaptively realigns explainers using only unlabeled OOD activations, enhancing interpretability under shift.
Findings
GAE reduces the faithfulness gap caused by distribution shift.
Empirical results show GAE matches or surpasses training-based baselines.
Theoretically, GAE's excess loss is quadratically bounded by second-moment shift.
Abstract
Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
