Interpreting Negation in GPT-2: Layer- and Head-Level Causal Analysis

Abdullah Al Mofael; Lisa M. Kuhn; Ghassan Alkadi; Kuo-Pao Yang

arXiv:2603.12423·cs.CL·March 16, 2026

Interpreting Negation in GPT-2: Layer- and Head-Level Causal Analysis

Abdullah Al Mofael, Lisa M. Kuhn, Ghassan Alkadi, Kuo-Pao Yang

PDF

Open Access

TL;DR

This paper investigates how GPT-2 processes negation internally by analyzing layer- and head-level causal mechanisms, revealing that negation sensitivity is concentrated in specific mid-layer attention heads and can be disrupted or restored through targeted interventions.

Contribution

It introduces a causal analysis framework for understanding negation in GPT-2, identifying key attention heads responsible for negation sensitivity and demonstrating their causal role.

Findings

01

Negation sensitivity is concentrated in specific mid-layer attention heads.

02

Ablation of these heads weakens negation sensitivity, which can be partially restored by activation rescue.

03

The causal patterns are consistent across different negation forms and datasets.

Abstract

Negation remains a persistent challenge for modern language models, often causing reversed meanings or factual errors. In this work, we conduct a causal analysis of how GPT-2 Small internally processes such linguistic transformations. We examine its hidden representations at both the layer and head level. Our analysis is based on a self-curated 12,000-pair dataset of matched affirmative and negated sentences, covering multiple linguistic templates and forms of negation. To quantify this behavior, we define a metric, the Negation Effect Score (NES), which measures the model's sensitivity in distinguishing between affirmative statements and their negations. We carried out two key interventions to probe causal structure. In activation patching, internal activations from affirmative sentences were inserted into their negated counterparts to see how meaning shifted. In ablation, specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Domain Adaptation and Few-Shot Learning