Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

Michael A. Riegler; Birk Sebastian Frostelid Torpmann-Hagen

arXiv:2605.03160·cs.LG·May 6, 2026

Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

Michael A. Riegler, Birk Sebastian Frostelid Torpmann-Hagen

PDF

TL;DR

This paper introduces a pairwise matrix protocol to analyze sparse autoencoder interpretability, revealing complex feature interactions and causal axes that standard methods miss, with experiments on multiple models.

Contribution

It proposes a novel pairwise matrix approach for interpretability, uncovering feature interactions and causal axes not detected by traditional single-feature protocols.

Findings

01

Features can produce inverted U-shapes under coefficient sweeps.

02

Joint feature suppression affects grounded composition more than single-feature suppression.

03

Matched-geometry perturbations reveal distinct output regimes.

Abstract

The standard sparse-autoencoder (SAE) interpretability protocol labels each feature from its top-activating contexts and validates by single-feature steering. We propose the pairwise matrix protocol, co-varying steering coefficient with joint condition, and report three findings the standard one-corner protocol misses on Qwen3-1.7B-Instruct, replicated on Gemma-2-2B-it. First, a feature labelled "AI self-disclaimer" from its top contexts produces an inverted U-shape under a coefficient sweep: at c=+500 the model substitutes a fluent contemplative-philosopher voice for the disclaimer. Two further features anchor the criterion (one monotonic, one pure breakdown). Second, three near-orthogonal cluster-specific features that individually steer a philosophy-of-mind register, jointly suppressed at c=-500, damage grounded composition on recipes and engine explanations as well as introspective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.