Neuron-Level Emotion Control in Speech-Generative Large Audio-Language Models
Xiutian Zhao, Ismail Rasim Ulgen, Philipp Koehn, Bj\"orn Schuller, Berrak Sisman

TL;DR
This paper introduces a neuron-level method for controlling emotion in speech-generative models, enabling precise, training-free emotion steering that generalizes across speakers and maintains content fidelity.
Contribution
It is the first to identify and utilize emotion-sensitive neurons in large audio-language models for causal, training-free emotion control during inference.
Findings
Emotion-sensitive neurons can be causally manipulated for emotion control.
Interventions improve emotion accuracy across unseen speakers.
Control depends on selector design and intervention parameters.
Abstract
Large audio-language models (LALMs) can produce expressive speech, yet reliable emotion control remains elusive: conversions often miss the target affect and may degrade linguistic fidelity through refusals, hallucinations, or paraphrase. We present, to our knowledge, the first neuron-level study of emotion control in speech-generative LALMs and demonstrate that compact emotion-sensitive neurons (ESNs) are causally actionable, enabling training-free emotion steering at inference time. ESNs are identified via success-filtered activation aggregation enforcing both emotion realization and content preservation. Across three LALMs (Qwen2.5-Omni-7B, MiniCPM-o 4.5, Kimi-Audio), ESN interventions yield emotion-specific gains that generalize to unseen speakers and are supported by automatic and human evaluation. Controllability depends on selector design, mask sparsity, filtering, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Neuroscience and Music Perception · Music and Audio Processing
