DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation
Zhichao Wu, Qiulin Li, Sixing Liu, Qun Yang

TL;DR
This paper introduces DCTTS, a discrete diffusion model with contrastive learning for text-to-speech synthesis, achieving high quality and fast sampling with reduced resource consumption.
Contribution
The paper presents a novel discrete diffusion model with contrastive learning and an efficient text encoder for improved TTS performance and efficiency.
Findings
High-quality speech synthesis demonstrated
Significant reduction in resource consumption
Faster sampling speed compared to traditional diffusion models
Abstract
In the Text-to-speech(TTS) task, the latent diffusion model has excellent fidelity and generalization, but its expensive resource consumption and slow inference speed have always been a challenging. This paper proposes Discrete Diffusion Model with Contrastive Learning for Text-to-Speech Generation(DCTTS). The following contributions are made by DCTTS: 1) The TTS diffusion model based on discrete space significantly lowers the computational consumption of the diffusion model and improves sampling speed; 2) The contrastive learning method based on discrete space is used to enhance the alignment connection between speech and text and improve sampling quality; and 3) It uses an efficient text encoder to simplify the model's parameters and increase computational efficiency. The experimental results demonstrate that the approach proposed in this paper has outstanding speech synthesis quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
