High-Fidelity Speech Synthesis with Minimal Supervision: All Using   Diffusion Models

Chunyu Qiang; Hao Li; Yixin Tian; Yi Zhao; Ying Zhang; Longbiao Wang,; Jianwu Dang

arXiv:2309.15512·cs.SD·December 19, 2023

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models

Chunyu Qiang, Hao Li, Yixin Tian, Yi Zhao, Ying Zhang, Longbiao Wang,, Jianwu Dang

PDF

Open Access

TL;DR

This paper introduces a diffusion model-based, minimally-supervised speech synthesis system that improves controllability, prosody diversity, and audio fidelity by addressing limitations of existing methods.

Contribution

It presents a novel diffusion model framework for high-fidelity speech synthesis with minimal supervision, utilizing contrastive token-acoustic pretraining and continuous regression tasks.

Findings

01

Outperforms baseline methods in quality

02

Enhances prosodic diversity and controllability

03

Reduces waveform distortion

Abstract

Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsDiffusion