From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models
Caleb Munigety

TL;DR
This paper introduces a comprehensive five-stage methodology for causal feature analysis in transformer language models, demonstrated on GPT-2, revealing insights into feature causality, robustness, and deployment cost-efficiency.
Contribution
It presents a novel, end-to-end five-stage framework for causal feature analysis in transformers, combining probe design, validation, robustness, and deployment evaluation.
Findings
Activation patching recovers the IOI circuit with layer-9 head 9.
Sparse autoencoder recovers features with effect sizes of 30-50 units.
Ablation of fifteen features leaves 98% prompt accuracy.
Abstract
We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
