DiffCSS: Diverse and Expressive Conversational Speech Synthesis with   Diffusion Models

Weihao wu; Zhiwei Lin; Yixuan Zhou; Jingbei Li; Rui Niu; Qinghua Wu,; Songjun Cao; Long Ma; Zhiyong Wu

arXiv:2502.19924·cs.SD·February 28, 2025

DiffCSS: Diverse and Expressive Conversational Speech Synthesis with Diffusion Models

Weihao wu, Zhiwei Lin, Yixuan Zhou, Jingbei Li, Rui Niu, Qinghua Wu,, Songjun Cao, Long Ma, Zhiyong Wu

PDF

Open Access

TL;DR

DiffCSS introduces a diffusion model-based framework for conversational speech synthesis that generates diverse, expressive, and contextually coherent speech, surpassing existing deterministic systems in naturalness and variety.

Contribution

The paper presents a novel diffusion model approach combined with an LM-based TTS backbone for the first time in CSS, enabling diverse and expressive speech synthesis.

Findings

01

Synthesized speech is more diverse and expressive.

02

Speech is more contextually coherent.

03

System outperforms existing CSS methods.

Abstract

Conversational speech synthesis (CSS) aims to synthesize both contextually appropriate and expressive speech, and considerable efforts have been made to enhance the understanding of conversational context. However, existing CSS systems are limited to deterministic prediction, overlooking the diversity of potential responses. Moreover, they rarely employ language model (LM)-based TTS backbones, limiting the naturalness and quality of synthesized speech. To address these issues, in this paper, we propose DiffCSS, an innovative CSS framework that leverages diffusion models and an LM-based TTS backbone to generate diverse, expressive, and contextually coherent speech. A diffusion-based context-aware prosody predictor is proposed to sample diverse prosody embeddings conditioned on multimodal conversational context. Then a prosody-controllable LM-based TTS backbone is developed to synthesize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and dialogue systems

MethodsDiffusion