Minimally-Supervised Speech Synthesis with Conditional Diffusion Model   and Language Model: A Comparative Study of Semantic Coding

Chunyu Qiang; Hao Li; Hao Ni; He Qu; Ruibo Fu; Tao Wang; Longbiao; Wang; Jianwu Dang

arXiv:2307.15484·cs.SD·December 19, 2023

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

Chunyu Qiang, Hao Li, Hao Ni, He Qu, Ruibo Fu, Tao Wang, Longbiao, Wang, Jianwu Dang

PDF

Open Access

TL;DR

This paper introduces three progressive diffusion-based models for minimally-supervised speech synthesis, addressing key issues in semantic coding and prosody modeling to improve audio quality and diversity.

Contribution

It proposes Diff-LM-Speech, Tetra-Diff-Speech, and Tri-Diff-Speech, novel diffusion-based architectures that enhance semantic encoding and prosody control in TTS with minimal supervision.

Findings

01

Proposed models outperform baseline methods in quality.

02

Diffusion models improve semantic embedding accuracy.

03

Non-autoregressive structures enable diverse prosodic expressions.

Abstract

Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. However, existing methods suffer from three problems: the high dimensionality and waveform distortion of discrete speech representations, the prosodic averaging problem caused by the duration prediction model in non-autoregressive frameworks, and the information redundancy and dimension explosion problems of existing semantic encoding methods. To address these problems, three progressive methods are proposed. First, we propose Diff-LM-Speech, an autoregressive structure consisting of a language model and diffusion models, which models the semantic embedding into the mel-spectrogram based on a diffusion model to achieve higher audio quality. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsDiffusion