ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control

Shengpeng Ji; Qian Chen; Wen Wang; Jialong Zuo; Minghui Fang; Ziyue Jiang; Hai Huang; Zehan Wang; Xize Cheng; Siqi Zheng; Zhou Zhao

arXiv:2406.01205·eess.AS·June 5, 2025

ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control

Shengpeng Ji, Qian Chen, Wen Wang, Jialong Zuo, Minghui Fang, Ziyue Jiang, Hai Huang, Zehan Wang, Xize Cheng, Siqi Zheng, Zhou Zhao

PDF

Open Access 1 Repo

TL;DR

ControlSpeech is a novel TTS system that achieves simultaneous zero-shot speaker cloning and style control, allowing flexible and high-quality speech synthesis with independent control over timbre, content, and style.

Contribution

The paper introduces ControlSpeech, a system that combines zero-shot speaker cloning with independent style control using a novel decoupling codec and style mixture density module.

Findings

01

Achieves state-of-the-art controllability and quality in TTS.

02

Demonstrates effective zero-shot speaker cloning and style manipulation.

03

Provides a new dataset for style-controlled speech synthesis.

Abstract

In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style. Prior zero-shot TTS models only mimic the speaker's voice without further control and adjustment capabilities while prior controllable TTS models cannot perform speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging task: a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture codec representations corresponding to timbre, content, and style in a discrete decoupling codec space. Moreover, we analyze the many-to-many issue in textual style control and propose the Style Mixture Semantic Density…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jishengpeng/controlspeech
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling