TL;DR
EmoSteer-TTS introduces a training-free, activation steering method for fine-grained, continuous emotion control in TTS, enabling more natural and flexible emotional speech synthesis without extensive datasets.
Contribution
It presents the first training-free approach for continuous emotion control in TTS by modifying internal activations, applicable to various pretrained models.
Findings
Enables fine-grained emotion manipulation in TTS
Outperforms state-of-the-art methods in emotion control
Works seamlessly with multiple pretrained TTS models
Abstract
Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Works across multiple flow-matching backbones; no fine-tuning required. - Empirical analyses give actionable guidance: k≈200 works well; multi-layer (spaced) steering outperforms shallow-only; steering across all flow steps is strongest. - Maintains performance on EMNS/SeedTTS despite steering vectors built from other corpora. - Low WER and high speaker similarity versus strong flow-matching baselines.
- The paper prefers a large alpha but lacks a clear tradeoff curve (alpha vs. WER/N-MOS/E-SIM) and recommended operating range. - Emotion scores use emotion2vec/SenseVoice; although both are reported, objective metrics can bias toward specific embeddings.
* Originality - Introduces a training-free, activation-steering paradigm for emotion control in TTS, a clear departure from the prevailing label- or description-conditioned methods that require large-scale training and supervision. - Creatively adapts activation steering—previously shown effective in LLMs and T2I diffusion—to flow-matching, DiT-based TTS models, demonstrating cross-domain transfer of a control technique to speech generation. - Proposes a principled pipeline to discover em
The approach, while training-free at inference, still relies on a curated pool of high-quality emotional speech to build steering vectors, which weakens the claim of being data-free and raises questions about scalability. Please quantify sample complexity (how many and what quality of references are needed), test cross-lingual transfer (build in one language, apply to another), and assess robustness to noise, reverberation, and device/domain mismatch. Token selection and several evaluations dep
The idea of training-free fine-grained emotional control is interesting for advancing expressive TTS systems. If it can well explained and validated, the proposed approach has the potential to reduce the reliance on large paired emotional datasets, which remains a challenge in emotional TTS.
This paper can be further improved by addressing the limitations including insufficient literature coverage, unclear method design, limited novelties, weal results and unclear reproducibility details. The methodology section is not clearly written. I suggest improving it by explaining the underlying motivation and the rationale behind the design choices. Additionally, please clarify what each equation represents and how it contributes to the overall approach. The related work section would ben
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
