VoiceLDM: Text-to-Speech with Environmental Context
Yeonghyeon Lee, Inmo Yeon, Juhan Nam, Joon Son Chung

TL;DR
VoiceLDM is a novel text-to-audio model that generates contextually accurate audio from environmental and content prompts using latent diffusion, pretrained models, and dual guidance, surpassing ground truth speech intelligibility.
Contribution
The paper introduces VoiceLDM, a new model that incorporates environmental and content prompts for improved controllable text-to-audio synthesis using latent diffusion and pretrained models.
Findings
Generates plausible audio aligned with input prompts.
Surpasses ground truth speech intelligibility on AudioCaps.
Achieves competitive zero-shot TTS and T-Audio results.
Abstract
This paper presents VoiceLDM, a model designed to produce audio that accurately follows two distinct natural language text prompts: the description prompt and the content prompt. The former provides information about the overall environmental context of the audio, while the latter conveys the linguistic content. To achieve this, we adopt a text-to-audio (TTA) model based on latent diffusion models and extend its functionality to incorporate an additional content prompt as a conditional input. By utilizing pretrained contrastive language-audio pretraining (CLAP) and Whisper, VoiceLDM is trained on large amounts of real-world audio without manual annotations or transcriptions. Additionally, we employ dual classifier-free guidance to further enhance the controllability of VoiceLDM. Experimental results demonstrate that VoiceLDM is capable of generating plausible audio that aligns well with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsDiffusion
