TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

Qifan Liang; Yuansen Liu; Ruixin Wei; Nan Lu; Junchuan Zhao; Ye Wang

arXiv:2601.03170·cs.SD·May 19, 2026

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

Qifan Liang, Yuansen Liu, Ruixin Wei, Nan Lu, Junchuan Zhao, Ye Wang

PDF

1 Repo 1 Datasets

TL;DR

TED-TTS introduces a training-free, controllable TTS framework that enables fine-grained intra-utterance emotion and duration control using a novel segment-aware conditioning strategy, without requiring additional training or manual prompts.

Contribution

It proposes a novel training-free approach for intra-utterance emotion and duration control in TTS, utilizing segment-aware strategies and automatic prompt construction from a large annotated dataset.

Findings

01

Achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control.

02

Maintains baseline speech quality while enabling fine-grained control.

03

Demonstrates effectiveness through extensive experiments.

Abstract

While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose TED-TTS, a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Datasets

Chanson-0803/MED-TTS
dataset· 189 dl
189 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Topic Modeling