Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

Mahdi Nasermoghadasi

arXiv:2605.22719·cs.LG·May 22, 2026

Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification

Mahdi Nasermoghadasi

PDF

TL;DR

This paper presents a reproducible audit of GPT-2 small's features during an IOI task, identifying correlated features with failure but demonstrating they are not causally responsible.

Contribution

It introduces a model-agnostic, cost-effective audit pipeline that surfaces interpretable correlated features without claiming causality.

Findings

01

146 SAE features significantly correlate with task failure.

02

Feature 17,491 correlates strongly with failure on 'the keys' object.

03

A causal ablation of feature 17,491 does not improve accuracy.

Abstract

We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual-stream SAE release of Bloom (2024) clear a Holm-corrected significance threshold and 105 reach a large effect size (|Cohen's d| > 0.8). The strongest single correlate of failure -- feature 17,491, d=+2.93, Neuronpedia label 'cryptographic keys' -- is essentially silent except when the prompt's transferred object is 'the keys,' on which GPT-2 small fails 93.3% of the time vs. 7.5% on the other seven objects (Fisher exact p = 8.79 x 10^-33). We put this correlate through three controls that a mechanistic claim should pass. (i) A causal ablation: zeroing feature 17,491 in the residual stream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.