Mechanistic Interpretability of GPT-2: Lexical and Contextual Layers in Sentiment Analysis
Amartya Hatua

TL;DR
This study investigates how GPT-2 processes sentiment, revealing that lexical detection occurs early and contextual understanding happens in late layers through a unified mechanism, challenging previous hierarchical models.
Contribution
It provides causal, layer-wise evidence that sentiment processing in GPT-2 involves early lexical detection and late-stage contextual integration via a non-modular approach.
Findings
Early layers detect lexical sentiment independently of context.
Mid-layer hypotheses about contextual integration are falsified.
Contextual phenomena are integrated mainly in late layers through a unified mechanism.
Abstract
We present a mechanistic interpretability study of GPT-2 that causally examines how sentiment information is processed across its transformer layers. Using systematic activation patching across all 12 layers, we test the hypothesized two-stage sentiment architecture comprising early lexical detection and mid-layer contextual integration. Our experiments confirm that early layers (0-3) act as lexical sentiment detectors, encoding stable, position specific polarity signals that are largely independent of context. However, all three contextual integration hypotheses: Middle Layer Concentration, Phenomenon Specificity, and Distributed Processing are falsified. Instead of mid-layer specialization, we find that contextual phenomena such as negation, sarcasm, domain shifts etc. are integrated primarily in late layers (8-11) through a unified, non-modular mechanism. These experimental findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSentiment Analysis and Opinion Mining · Emotion and Mood Recognition · Explainable Artificial Intelligence (XAI)
