EmoShift: Lightweight Activation Steering for Enhanced Emotion-Aware Speech Synthesis
Li Zhou, Hao Jiang, Junjie Li, Tianrui Wang, Haizhou Li

TL;DR
EmoShift introduces a lightweight activation-steering method for emotion-aware speech synthesis, enabling precise emotional control with minimal parameters and outperforming traditional fine-tuning approaches.
Contribution
The paper proposes EmoShift, a novel low-parameter framework with an EmoSteer layer that effectively models emotion-specific latent features in speech synthesis.
Findings
Outperforms zero-shot and fully fine-tuned baselines in evaluations
Uses less than 1/30 of the parameters compared to full fine-tuning
Enhances emotional expressiveness while maintaining naturalness and speaker similarity
Abstract
Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Mental Health via Writing · Emotion and Mood Recognition
