Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison
Jihoon Jeong

TL;DR
This paper compares methods for extracting and steering emotion representations in small language models, revealing insights into their internal emotional structure, effects of architecture, and cross-lingual entanglement.
Contribution
It provides the first comparative analysis of emotion extraction methods in small language models, highlighting the superiority of generation-based extraction and detailing how emotion representations are localized and manipulated.
Findings
Generation-based extraction yields better emotion separation.
Emotion representations localize at middle transformer layers (~50%).
Steering experiments confirm causal effects and reveal three operational regimes.
Abstract
Small language models (SLMs) in the 100M-10B parameter range increasingly power production systems, yet whether they possess the internal emotion representations recently discovered in frontier models remains unknown. We present the first comparative analysis of emotion vector extraction methods for SLMs, evaluating 9 models across 5 architectural families (GPT-2, Gemma, Qwen, Llama, Mistral) using 20 emotions and two extraction methods (generation-based and comprehension-based). Generation-based extraction produces statistically superior emotion separation (Mann-Whitney p = 0.007; Cohen's d = -107.5), with the advantage modulated by instruction tuning and architecture. Emotion representations localize at middle transformer layers (~50% depth), following a U-shaped curve that is architecture-invariant from 124M to 3B parameters. We validate these findings against representational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
