DCTTS: Discrete Diffusion Model with Contrastive Learning for   Text-to-speech Generation

Zhichao Wu; Qiulin Li; Sixing Liu; Qun Yang

arXiv:2309.06787·cs.SD·September 14, 2023

DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation

Zhichao Wu, Qiulin Li, Sixing Liu, Qun Yang

PDF

Open Access

TL;DR

This paper introduces DCTTS, a discrete diffusion model with contrastive learning for text-to-speech synthesis, achieving high quality and fast sampling with reduced resource consumption.

Contribution

The paper presents a novel discrete diffusion model with contrastive learning and an efficient text encoder for improved TTS performance and efficiency.

Findings

01

High-quality speech synthesis demonstrated

02

Significant reduction in resource consumption

03

Faster sampling speed compared to traditional diffusion models

Abstract

In the Text-to-speech(TTS) task, the latent diffusion model has excellent fidelity and generalization, but its expensive resource consumption and slow inference speed have always been a challenging. This paper proposes Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation(DCTTS). The following contributions are made by DCTTS: 1) The TTS diffusion model based on discrete space significantly lowers the computational consumption of the diffusion model and improves sampling speed; 2) The contrastive learning method based on discrete space is used to enhance the alignment connection between speech and text and improve sampling quality; and 3) It uses an efficient text encoder to simplify the model's parameters and increase computational efficiency. The experimental results demonstrate that the approach proposed in this paper has outstanding speech synthesis quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing