EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Tianxin Xie; Shan Yang; Chenxing Li; Dong Yu; Li Liu

arXiv:2508.03543·cs.SD·October 28, 2025

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, Li Liu

PDF

3 Reviews

TL;DR

EmoSteer-TTS introduces a training-free, activation steering method for fine-grained, continuous emotion control in TTS, enabling more natural and flexible emotional speech synthesis without extensive datasets.

Contribution

It presents the first training-free approach for continuous emotion control in TTS by modifying internal activations, applicable to various pretrained models.

Findings

01

Enables fine-grained emotion manipulation in TTS

02

Outperforms state-of-the-art methods in emotion control

03

Works seamlessly with multiple pretrained TTS models

Abstract

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

- Works across multiple flow-matching backbones; no fine-tuning required. - Empirical analyses give actionable guidance: k≈200 works well; multi-layer (spaced) steering outperforms shallow-only; steering across all flow steps is strongest. - Maintains performance on EMNS/SeedTTS despite steering vectors built from other corpora. - Low WER and high speaker similarity versus strong flow-matching baselines.

Weaknesses

- The paper prefers a large alpha but lacks a clear tradeoff curve (alpha vs. WER/N-MOS/E-SIM) and recommended operating range. - Emotion scores use emotion2vec/SenseVoice; although both are reported, objective metrics can bias toward specific embeddings.

Reviewer 02Rating 6Confidence 2

Strengths

* Originality - Introduces a training-free, activation-steering paradigm for emotion control in TTS, a clear departure from the prevailing label- or description-conditioned methods that require large-scale training and supervision. - Creatively adapts activation steering—previously shown effective in LLMs and T2I diffusion—to flow-matching, DiT-based TTS models, demonstrating cross-domain transfer of a control technique to speech generation. - Proposes a principled pipeline to discover em

Weaknesses

The approach, while training-free at inference, still relies on a curated pool of high-quality emotional speech to build steering vectors, which weakens the claim of being data-free and raises questions about scalability. Please quantify sample complexity (how many and what quality of references are needed), test cross-lingual transfer (build in one language, apply to another), and assess robustness to noise, reverberation, and device/domain mismatch. Token selection and several evaluations dep

Reviewer 03Rating 2Confidence 5

Strengths

The idea of training-free fine-grained emotional control is interesting for advancing expressive TTS systems. If it can well explained and validated, the proposed approach has the potential to reduce the reliance on large paired emotional datasets, which remains a challenge in emotional TTS.

Weaknesses

This paper can be further improved by addressing the limitations including insufficient literature coverage, unclear method design, limited novelties, weal results and unclear reproducibility details. The methodology section is not clearly written. I suggest improving it by explaining the underlying motivation and the rationale behind the design choices. Additionally, please clarify what each equation represents and how it contributes to the overall approach. The related work section would ben

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.