From Syntax to Emotion: A Mechanistic Analysis of Emotion Inference in LLMs
Bangzhao Shu, Arinjay Singh, Mai ElSherief

TL;DR
This paper systematically analyzes how large language models internally recognize emotions, revealing a three-phase information flow and proposing a causal feature steering method to enhance emotion recognition performance.
Contribution
It introduces a mechanistic analysis of emotion inference in LLMs, identifying key features and phases, and presents a novel, interpretable method to improve emotion recognition accuracy.
Findings
Emotion features emerge only in the final processing phase.
Disgust is more weakly and diffusely represented than other emotions.
The proposed causal feature steering improves emotion recognition across models and datasets.
Abstract
Large language models (LLMs) are increasingly used in emotionally sensitive human-AI applications, yet little is known about how emotion recognition is internally represented. In this work, we investigate the internal mechanisms of emotion recognition in LLMs using sparse autoencoders (SAEs). By analyzing sparse feature activations across layers, we identify a consistent three-phase information flow, in which emotion-related features emerge only in the final phase. We further show that emotion representations comprise both shared features across emotions and emotion-specific features. Using phase-stratified causal tracing, we identify a small set of features that strongly influence emotion predictions, and show that both their number and causal impact vary across emotions; in particular, Disgust is more weakly and diffusely represented than other emotions. Finally, we propose an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
