DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model
Lei Zhao, Sizhou Chen, Linfeng Feng, Jichao Zhang, Xiao-Lei Zhang, Chi Zhang, Xuelong Li

TL;DR
DualSpec is a novel framework that generates immersive spatial audio from text descriptions by combining dual spectrogram features and diffusion models, advancing text-to-audio synthesis with spatial accuracy.
Contribution
It introduces a dual-spectrogram guided diffusion approach for text-to-spatial-audio generation, improving synthesis quality and azimuth accuracy over prior monaural-focused methods.
Findings
High-quality spatial audio generation from text descriptions.
Improved azimuth accuracy in spatial sound synthesis.
Effective use of dual spectrogram features for better results.
Abstract
Text-to-audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named DualSpec. Specifically, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Phonetics and Phonology Research
MethodsSoftmax · Attention Is All You Need · Diffusion
