From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?
Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

TL;DR
This paper investigates when interpretability methods like autoencoders and probes can reliably identify and disentangle known concepts in neural networks, especially under correlated concept scenarios, revealing limitations in current evaluation metrics.
Contribution
It introduces a multi-concept evaluation framework controlling concept correlations, revealing that current disentanglement metrics and methods have significant limitations in independence and selectivity.
Findings
Features correspond to at most one concept but concepts spread across many features.
Features affect many concepts when steered, indicating lack of independence.
Disjoint subspaces do not guarantee concept selectivity.
Abstract
A central goal of interpretability is to recover representations of causally relevant concepts from the activations of neural networks. The quality of these concept representations is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear whether common featurization methods - including sparse autoencoders (SAEs) and sparse probes - recover disentangled representations of these concepts. This study proposes a multi-concept evaluation setting where we control the correlations between textual concepts, such as sentiment, domain, and tense, and analyze performance under increasing correlations between them. We first evaluate the extent to which featurizers can learn disentangled representations of each concept under increasing correlational strengths. We observe a one-to-many relationship from concepts to features:…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Overall, the paper shows that SAEs struggle compared to supervised methods along several interesting axes, and may need multiple latents to express a single concept. Other strengths include: - a controlled setup that is nontrivial enough to be interesting while still retaining useful ground truth - lots of relevant experiments - writing is clear for the most part, though there are a lot of details to the experiments.
- There is only a synthetic dataset evaluation, and the work feels overall incremental compared to prior works. - there are lots of different procedures to assign SAE latent(s) to a concept (at least three by my count), which could make it confusing for readers to navigate. - In experiment 4.2., my understanding is that both the steering manipulation (equation 2, lines 314-315) and the method to evaluate log-odds (using the output of a linear probe) are linear, so it seems tautological that any
The idea of measuring disentanglement and so interpretability of SAE-like methods is timely and important to gain insight on what we can expect in practice for these methods. The insight about steering and disjointness are useful and well-explained in the paper. The paper is well presented, and research analyses are well formulated and investigated with rigor. Overall, the paper shows in a clear way that SAE-like methods do not come with guarantees for properly disentangling ground-truth conce
This paper misses comparisons to other works that recently appeared in SAE literature and treat similarly related aspects (among which identifiability), e.g. [1,2,3]. For this, I cannot say the paper excels in novelty. Also, while MCC is a quite popular metric for studying disentanglement of representations, it has some pitfalls (since it only tests correlations) that other disentanglement metrics cover, see e.g. [4,5]. For example, DCI-ES [5] includes a training phase on a probe to detect whi
- Evaluating the ability of interpretability methods to disentangle correlated concepts is a very relevant question for the field. The approach of using synthetic datasets with controlled correlations is elegant and allows interpretability methods to be evaluated on actual models with still natural-ish text. - Diverse evaluation learned representations - feature alignment, sparse probing, steeringto measure via independence and disjointness - Diverse set of interpretability methods and baseline
- Several load-bearing details are omitted from the manuscript. It's unclear how sampled concepts are used to construct the actual natural language sentences in the dataset. For SAEs, critical training details are missing—particularly sparsity levels and reconstruction loss. Without these metrics, it's impossible to determine whether SAE performance differences reflect poor training or actual effects of data correlations. - The synthetic data setup is compelling, but the experiments are limited
The paper addresses an important lacuna in the field of interpretability research. Prior research has focused on evaluating features in isolation, but does not consider their interactions with other features. The paper has excellent presentation, clearly defining how it seeks to measure disentanglement, setting lower and upper bounds with the "baselines and skylines" section, and helping the reader understand the shape of its figures in the text and the "groundtruth" in Figure 4.
The paper is modest in scope. It seeks to evaluate existing techniques via new metrics, but without making the evaluations into a benchmark which could be easily used and built upon by other researchers. Despite otherwise clear communication, the paper omits otherwise key information to reproduce its results. This includes: - Which layer(s) of the language model were used to train the SAEs. - Hyperparameters of SAE training, such as number of features and amount of sparsity. - The structure of
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Sentiment Analysis and Opinion Mining
