DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

Kang Yin; Chunyu Qiang; Sirui Zhao; Xiaopeng Wang; Yuzhe Liang; Pengfei Cai; Tong Xu; Chen Zhang; Enhong Chen

arXiv:2512.09504·cs.SD·December 11, 2025

DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

Kang Yin, Chunyu Qiang, Sirui Zhao, Xiaopeng Wang, Yuzhe Liang, Pengfei Cai, Tong Xu, Chen Zhang, Enhong Chen

PDF

Open Access

TL;DR

DMP-TTS introduces a disentangled, multi-modal prompting framework for controllable TTS, enabling independent manipulation of speaker and style attributes with improved controllability and training stability.

Contribution

The paper proposes a novel DiT-based TTS model with explicit disentanglement, multi-modal prompting, and hierarchical guidance, advancing controllability and training efficiency.

Findings

01

Outperforms open-source baselines in style controllability

02

Maintains competitive speech intelligibility and naturalness

03

Uses novel hierarchical guidance and feature distillation techniques

Abstract

Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research