DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model

Lei Zhao; Sizhou Chen; Linfeng Feng; Jichao Zhang; Xiao-Lei Zhang; Chi Zhang; Xuelong Li

arXiv:2502.18952·cs.SD·June 9, 2025

DualSpec: Text-to-spatial-audio Generation via Dual-Spectrogram Guided Diffusion Model

Lei Zhao, Sizhou Chen, Linfeng Feng, Jichao Zhang, Xiao-Lei Zhang, Chi Zhang, Xuelong Li

PDF

Open Access

TL;DR

DualSpec is a novel framework that generates immersive spatial audio from text descriptions by combining dual spectrogram features and diffusion models, advancing text-to-audio synthesis with spatial accuracy.

Contribution

It introduces a dual-spectrogram guided diffusion approach for text-to-spatial-audio generation, improving synthesis quality and azimuth accuracy over prior monaural-focused methods.

Findings

01

High-quality spatial audio generation from text descriptions.

02

Improved azimuth accuracy in spatial sound synthesis.

03

Effective use of dual spectrogram features for better results.

Abstract

Text-to-audio (TTA), which generates audio signals from textual descriptions, has received huge attention in recent years. However, recent works focused on text to monaural audio only. As we know, spatial audio provides more immersive auditory experience than monaural audio, e.g. in virtual reality. To address this issue, we propose a text-to-spatial-audio (TTSA) generation framework named DualSpec. Specifically, it first trains variational autoencoders (VAEs) for extracting the latent acoustic representations from sound event audio. Then, given text that describes sound events and event directions, the proposed method uses the encoder of a pretrained large language model to transform the text into text features. Finally, it trains a diffusion model from the latent acoustic representations and text features for the spatial audio generation. In the inference stage, only the text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Phonetics and Phonology Research

MethodsSoftmax · Attention Is All You Need · Diffusion