Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
Michael A. Riegler, Birk Sebastian Frostelid Torpmann-Hagen

TL;DR
This paper introduces a pairwise matrix protocol to analyze sparse autoencoder interpretability, revealing complex feature interactions and causal axes that standard methods miss, with experiments on multiple models.
Contribution
It proposes a novel pairwise matrix approach for interpretability, uncovering feature interactions and causal axes not detected by traditional single-feature protocols.
Findings
Features can produce inverted U-shapes under coefficient sweeps.
Joint feature suppression affects grounded composition more than single-feature suppression.
Matched-geometry perturbations reveal distinct output regimes.
Abstract
The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-varying steering coefficient with joint condition, and report three findings the standard one-corner protocol misses on Qwen3-1.7B-Instruct, replicated on Gemma-2-2B-it. First, a feature labelled "AI self-disclaimer" from its top contexts produces an inverted U-shape under a coefficient sweep: at c=+500 the model substitutes a fluent contemplative-philosopher voice for the disclaimer. Two further features anchor the criterion (one monotonic, one pure breakdown). Second, three near-orthogonal cluster-specific features that individually steer a philosophy-of-mind register, jointly suppressed at c=-500, damage grounded composition on recipes and engine explanations as well as introspective…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
