From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

Caleb Munigety

arXiv:2605.22462·cs.CL·May 22, 2026

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

Caleb Munigety

PDF

TL;DR

This paper introduces a comprehensive five-stage methodology for causal feature analysis in transformer language models, demonstrated on GPT-2, revealing insights into feature causality, robustness, and deployment cost-efficiency.

Contribution

It presents a novel, end-to-end five-stage framework for causal feature analysis in transformers, combining probe design, validation, robustness, and deployment evaluation.

Findings

01

Activation patching recovers the IOI circuit with layer-9 head 9.

02

Sparse autoencoder recovers features with effect sizes of 30-50 units.

03

Ablation of fifteen features leaves 98% prompt accuracy.

Abstract

We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.