Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs

Michael Keeman

arXiv:2603.22295·cs.CL·March 25, 2026

Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs

Michael Keeman

PDF

Open Access

TL;DR

This study investigates whether large language models genuinely understand emotions or merely detect keywords, revealing two distinct mechanisms for affect reception and emotion categorization through clinical stimuli and interpretability methods.

Contribution

It introduces a novel clinical stimulus approach and mechanistic interpretability analysis to differentiate affect reception from emotion categorization in LLMs, challenging keyword-based explanations.

Findings

01

Affect reception operates with near-perfect accuracy across models.

02

Emotion categorization is partially keyword-dependent and improves with scale.

03

Representational analysis shows affect salience transfer between stimuli.

Abstract

Large language models appear to develop internal representations of emotion -- "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology -- clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods -- linear probing, causal activation patching, knockout…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Neurobiology of Language and Bilingualism · Face Recognition and Perception