VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware   Speech Synthesis

Jaemin Jung; Junseok Ahn; Chaeyoung Jung; Tan Dat Nguyen; Youngjoon; Jang; Joon Son Chung

arXiv:2412.19259·eess.AS·December 30, 2024

VoiceDiT: Dual-Condition Diffusion Transformer for Environment-Aware Speech Synthesis

Jaemin Jung, Junseok Ahn, Chaeyoung Jung, Tan Dat Nguyen, Youngjoon, Jang, Joon Son Chung

PDF

Open Access

TL;DR

VoiceDiT is a novel multi-modal diffusion transformer that generates environment-aware speech and audio from text and visual prompts, addressing alignment challenges in noisy conditions with improved quality.

Contribution

The paper introduces VoiceDiT, a new multi-modal generative model with a dual-condition diffusion transformer and a large-scale dataset for environment-aware speech synthesis.

Findings

01

Outperforms previous models on real-world datasets

02

Achieves better audio quality and environmental sound alignment

03

Effectively integrates multi-modal prompts in speech synthesis

Abstract

We present VoiceDiT, a multi-modal generative model for producing environment-aware speech and audio from text and visual prompts. While aligning speech with text is crucial for intelligible speech, achieving this alignment in noisy conditions remains a significant and underexplored challenge in the field. To address this, we present a novel audio generation pipeline named VoiceDiT. This pipeline includes three key components: (1) the creation of a large-scale synthetic speech dataset for pre-training and a refined real-world speech dataset for fine-tuning, (2) the Dual-DiT, a model designed to efficiently preserve aligned speech information while accurately reflecting environmental conditions, and (3) a diffusion-based Image-to-Audio Translator that allows the model to bridge the gap between audio and image, facilitating the generation of environmental sound that aligns with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems