DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

Tao Li; Wenshuo Ge; Zhichao Wang; Zihao Cui; Yong Ma; Yingying Gao; Chao Deng; Shilei Zhang; Junlan Feng

arXiv:2512.13251·cs.SD·January 6, 2026

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

Tao Li, Wenshuo Ge, Zhichao Wang, Zihao Cui, Yong Ma, Yingying Gao, Chao Deng, Shilei Zhang, Junlan Feng

PDF

Open Access

TL;DR

DisCo-Speech introduces a novel speech codec that disentangles content, prosody, and timbre, enabling zero-shot controllable speech synthesis with independent prosody and voice style manipulation.

Contribution

It proposes a two-stage disentangled speech codec and an LM-based generator for zero-shot controllable TTS, addressing entanglement issues in standard codecs.

Findings

01

Achieves competitive voice cloning performance.

02

Enables superior zero-shot prosody control.

03

Provides a robust foundation for controllable speech synthesis.

Abstract

Codec-based language models (LMs) have revolutionized text-to-speech (TTS). However, standard codecs entangle timbre and prosody, which hinders independent control in continuation-based LMs. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework featuring a disentangled speech codec (DisCodec) and an LM-based generator. The core component DisCodec employs a two-stage design: 1) tri-factor disentanglement to separate speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) fusion and reconstruction that merges content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction to address the disentanglement-reconstruction trade-off. This allows the LM to perform prosodic continuation from a style prompt while the decoder injects target timbre, enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research