VoiceLDM: Text-to-Speech with Environmental Context

Yeonghyeon Lee; Inmo Yeon; Juhan Nam; Joon Son Chung

arXiv:2309.13664·eess.AS·September 26, 2023

VoiceLDM: Text-to-Speech with Environmental Context

Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung

PDF

Open Access

TL;DR

VoiceLDM is a novel text-to-audio model that generates contextually accurate audio from environmental and content prompts using latent diffusion, pretrained models, and dual guidance, surpassing ground truth speech intelligibility.

Contribution

The paper introduces VoiceLDM, a new model that incorporates environmental and content prompts for improved controllable text-to-audio synthesis using latent diffusion and pretrained models.

Findings

01

Generates plausible audio aligned with input prompts.

02

Surpasses ground truth speech intelligibility on AudioCaps.

03

Achieves competitive zero-shot TTS and T-Audio results.

Abstract

This paper presents VoiceLDM, a model designed to produce audio that accurately follows two distinct natural language text prompts: the description prompt and the content prompt. The former provides information about the overall environmental context of the audio, while the latter conveys the linguistic content. To achieve this, we adopt a text-to-audio (TTA) model based on latent diffusion models and extend its functionality to incorporate an additional content prompt as a conditional input. By utilizing pretrained contrastive language-audio pretraining (CLAP) and Whisper, VoiceLDM is trained on large amounts of real-world audio without manual annotations or transcriptions. Additionally, we employ dual classifier-free guidance to further enhance the controllability of VoiceLDM. Experimental results demonstrate that VoiceLDM is capable of generating plausible audio that aligns well with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsDiffusion