LatentSpeech: Latent Diffusion for Text-To-Speech Generation

Haowei Lou; Helen Paik; Pari Delir Haghighi; Wen Hu; Lina Yao

arXiv:2412.08117·cs.SD·December 12, 2024

LatentSpeech: Latent Diffusion for Text-To-Speech Generation

Haowei Lou, Helen Paik, Pari Delir Haghighi, Wen Hu, Lina Yao

PDF

Open Access

TL;DR

LatentSpeech introduces a novel latent diffusion-based TTS system that significantly reduces computational complexity and improves speech naturalness and accuracy, outperforming existing models on benchmark datasets.

Contribution

This paper is the first to apply latent diffusion models to TTS, reducing target dimension and enhancing speech quality and efficiency.

Findings

01

25% improvement in Word Error Rate

02

24% reduction in Mel Cepstral Distortion

03

Further improvements with additional training data

Abstract

Diffusion-based Generative AI gains significant attention for its superior performance over other generative techniques like Generative Adversarial Networks and Variational Autoencoders. While it has achieved notable advancements in fields such as computer vision and natural language processing, their application in speech generation remains under-explored. Mainstream Text-to-Speech systems primarily map outputs to Mel-Spectrograms in the spectral space, leading to high computational loads due to the sparsity of MelSpecs. To address these limitations, we propose LatentSpeech, a novel TTS generation approach utilizing latent diffusion models. By using latent embeddings as the intermediate representation, LatentSpeech reduces the target dimension to 5% of what is required for MelSpecs, simplifying the processing for the TTS encoder and vocoder and enabling efficient high-quality speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling

MethodsSoftmax · Attention Is All You Need · Diffusion