Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

Thomas Heap; Tim Lawson; Lucy Farnik; Laurence Aitchison

arXiv:2501.17727·cs.LG·January 28, 2026

Automated Interpretability Metrics Do Not Distinguish Trained and Random Transformers

Thomas Heap, Tim Lawson, Lucy Farnik, Laurence Aitchison

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that common interpretability metrics for transformers cannot reliably distinguish between trained and randomly initialized models, questioning their effectiveness in measuring true learned features.

Contribution

It shows that standard SAE-based interpretability metrics are insufficient and often similar for both trained and random transformers, highlighting the need for better evaluation methods.

Findings

01

SAEs trained on random models yield similar interpretability scores as trained models.

02

High interpretability scores do not necessarily indicate learned, meaningful features.

03

Routine use of randomized baselines is recommended for more reliable interpretability assessment.

Abstract

Sparse autoencoders (SAEs) are widely used to extract sparse, interpretable latents from transformer activations. We test whether commonly used SAE quality metrics and automatic explanation pipelines can distinguish trained transformers from randomly initialized ones (e.g., where parameters are sampled i.i.d. from a Gaussian). Over a wide range of Pythia model sizes and multiple randomization schemes, we find that, in many settings, SAEs trained on randomly initialized transformers produce auto-interpretability scores and reconstruction metrics that are similar to those from trained models. These results show that high aggregate auto-interpretability scores do not, by themselves, guarantee that learned, computationally relevant features have been recovered. We therefore recommend treating common SAE metrics as useful but insufficient proxies for mechanistic interpretability and argue…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

Originality: The paper provides the most in-depth study of the relationship between auto-interpretability scoring methods and whether / how networks were randomly initialised that I am aware of. The use of a networks randomly initialised except for the embeddings space was a good choice and adds to the interestingness of the results since much of the interpretability of SAE features may come from association with and between tokens. Quality: The paper is comprehensive - comparing trained modles

Weaknesses

- Significance: While feature dashboards and auto-interpretability explanations are commonly used with SAEs, to my knowledge auto-interpretability scoring methods aren't generally considered to be very useful or meaningful (moreover, many SAE quality metrics are not particularly useful in practice). So while the result that these metrics may not be capturing something meaningful seems well supported - the significance is not particularly clear. - Soundness: The paper discusses auto-interpretabi

Reviewer 02Rating 8Confidence 4

Strengths

- the title: the title perfectly describes the paper's main finding - the methods are elegant and sound - the randomization, control, and training configuration (dataset, n tokens, buffer size, k, expansion, etc) is well-chosen. - entropy is a great metric to quantify the hypothesis that random SAE's features activate for token identities - the authors go into great depth and effort to present a hypothesis, experiments, and evidence that shows why SAEs trained on random weights may still exert,

Weaknesses

The primary reason why I think this paper is great but not exceptional is a possible limitation in its usefulness (that I'm happy to discuss during the rebuttal period). There could be two reasons for the papers main results: (1) SAEs or auto-interp methods are sus and we should not trust them, and (2), SAEs trained on random weights learn trivial features that are easy to guess and cause the high auto-interp scores. For example, SAEs in layer 0 (or even later) might learn 1 feature for every to

Reviewer 03Rating 6Confidence 3

Strengths

- S1: Since SAEs are very widely used tools for interpreting the internals of transformer models, the subject of investigation is of interest to the mechanistic interpretability community as a whole. - S2: The paper is largely well-written, with clear explanations and motivations for each experimental design choice. The Related Work section is extensive. - S3: A wide range of model sizes are tested (using the Pythia suite); the results and conclusions are therefore more likely to be generalizab

Weaknesses

Major: - W1: Based on Figure 2, token distribution entropy seems to only separate trained versus randomized transformers in models >1.0B parameters. The authors don’t appear to comment on this, but greater discussion of the specific results of this token distribution entropy metric would be appreciated. - W2: Somewhat lost in this paper is the original purpose of automated interpretability metrics for SAEs: to find out if the SAE is properly trained, and to monitor its training. It would be a us

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Anomaly Detection Techniques and Applications · Computational Physics and Python Applications