TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet

Jaeseok Jeong; Yuna Lee; Mingi Kwon; Youngjung Uh

arXiv:2507.04349·cs.SD·July 8, 2025

TTS-CtrlNet: Time varying emotion aligned text-to-speech generation with ControlNet

Jaeseok Jeong, Yuna Lee, Mingi Kwon, Youngjung Uh

PDF

Open Access

TL;DR

TTS-CtrlNet introduces a novel ControlNet-based method for fine-grained, time-varying emotion control in text-to-speech synthesis, enhancing existing models without full fine-tuning and achieving state-of-the-art results.

Contribution

It is the first to apply ControlNet to TTS for scalable, controllable, time-varying emotion synthesis while preserving original model capabilities.

Findings

01

Effective addition of emotion control to existing TTS models

02

Achieves state-of-the-art emotion similarity scores

03

Maintains naturalness and zero-shot voice cloning

Abstract

Recent advances in text-to-speech (TTS) have enabled natural speech synthesis, but fine-grained, time-varying emotion control remains challenging. Existing methods often allow only utterance-level control and require full model fine-tuning with a large emotion speech dataset, which can degrade performance. Inspired by adding conditional control to the existing model in ControlNet (Zhang et al, 2023), we propose the first ControlNet-based approach for controllable flow-matching TTS (TTS-CtrlNet), which freezes the original model and introduces a trainable copy of it to process additional conditions. We show that TTS-CtrlNet can boost the pretrained large TTS model by adding intuitive, scalable, and time-varying emotion control while inheriting the ability of the original model (e.g., zero-shot voice cloning & naturalness). Furthermore, we provide practical recipes for adding emotion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Sentiment Analysis and Opinion Mining