Whether, Not Which: Mechanistic Interpretability Reveals Dissociable Affect Reception and Emotion Categorization in LLMs
Michael Keeman

TL;DR
This study investigates whether large language models genuinely understand emotions or merely detect keywords, revealing two distinct mechanisms for affect reception and emotion categorization through clinical stimuli and interpretability methods.
Contribution
It introduces a novel clinical stimulus approach and mechanistic interpretability analysis to differentiate affect reception from emotion categorization in LLMs, challenging keyword-based explanations.
Findings
Affect reception operates with near-perfect accuracy across models.
Emotion categorization is partially keyword-dependent and improves with scale.
Representational analysis shows affect salience transfer between stimuli.
Abstract
Large language models appear to develop internal representations of emotion -- "emotion circuits," "emotion neurons," and structured emotional manifolds have been reported across multiple model families. But every study making these claims uses stimuli signalled by explicit emotion keywords, leaving a fundamental question unanswered: do these circuits detect genuine emotional meaning, or do they detect the word "devastated"? We present the first clinical validity test of emotion circuit claims using mechanistic interpretability methods grounded in clinical psychology -- clinical vignettes that evoke emotions through situational and behavioural cues alone, emotion keywords removed. Across six models (Llama-3.2-1B, Llama-3-8B, Gemma-2-9B; base and instruct variants), we apply four convergent mechanistic interpretability methods -- linear probing, causal activation patching, knockout…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Neurobiology of Language and Bilingualism · Face Recognition and Perception
