Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech   Using Consistent Diffusion Models

Heyang Xue; Shuai Guo; Pengcheng Zhu; Mengxiao Bi

arXiv:2308.10428·eess.AS·September 1, 2023

Multi-GradSpeech: Towards Diffusion-based Multi-Speaker Text-to-speech Using Consistent Diffusion Models

Heyang Xue, Shuai Guo, Pengcheng Zhu, Mengxiao Bi

PDF

Open Access

TL;DR

Multi-GradSpeech introduces a consistent diffusion model for multi-speaker text-to-speech, effectively reducing sampling drift and outperforming existing methods like Grad-TTS and fine-tuning in multi-speaker scenarios.

Contribution

The paper proposes the Consistent Diffusion Model (CDM) for multi-speaker TTS, addressing sampling drift issues and improving performance over prior diffusion-based models.

Findings

01

Significant performance improvements over Grad-TTS in multi-speaker TTS.

02

Outperforms fine-tuning approaches in multi-speaker scenarios.

03

Demonstrates the effectiveness of enforcing consistency during training.

Abstract

Despite imperfect score-matching causing drift in training and sampling distributions of diffusion models, recent advances in diffusion-based acoustic models have revolutionized data-sufficient single-speaker Text-to-Speech (TTS) approaches, with Grad-TTS being a prime example. However, the sampling drift problem leads to these approaches struggling in multi-speaker scenarios in practice due to more complex target data distribution compared to single-speaker scenarios. In this paper, we present Multi-GradSpeech, a multi-speaker diffusion-based acoustic models which introduces the Consistent Diffusion Model (CDM) as a generative modeling approach. We enforce the consistency property of CDM during the training process to alleviate the sampling drift problem in the inference stage, resulting in significant improvements in multi-speaker TTS performance. Our experimental results corroborate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsDiffusion