TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis
Qifan Liang, Yuansen Liu, Ruixin Wei, Nan Lu, Junchuan Zhao, Ye Wang

TL;DR
TED-TTS introduces a training-free, controllable TTS framework that enables fine-grained intra-utterance emotion and duration control using a novel segment-aware conditioning strategy, without requiring additional training or manual prompts.
Contribution
It proposes a novel training-free approach for intra-utterance emotion and duration control in TTS, utilizing segment-aware strategies and automatic prompt construction from a large annotated dataset.
Findings
Achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control.
Maintains baseline speech quality while enabling fine-grained control.
Demonstrates effectiveness through extensive experiments.
Abstract
While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose TED-TTS, a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Topic Modeling
