AudioSpa: Spatializing Sound Events with Text
Linfeng Feng, Lei Zhao, Boyu Zhu, Xiao-Lei Zhang, and Xuelong Li

TL;DR
AudioSpa is an innovative system that generates binaural spatial audio from text and monaural references, enhancing immersive sound experiences by accurately localizing sound sources using multimodal learning and data augmentation.
Contribution
This work introduces AudioSpa, the first end-to-end model for text-guided binaural audio generation with spatial localization capabilities, utilizing large language models and novel fusion techniques.
Findings
Achieves accurate sound source localization from text descriptions.
Demonstrates competitive performance in localization accuracy.
Employs data augmentation to improve spatialization diversity.
Abstract
Text-to-audio (TTA) systems have recently demonstrated strong performance in synthesizing monaural audio from text. However, the task of generating binaural spatial audio from text, which provides a more immersive auditory experience by incorporating the sense of spatiality, have not been explored yet. In this work, we introduce text-guided binaural audio generation. As an early effort, we focus on the scenario where a monaural reference audio is given additionally. The core problem is to associate specific sound events with their directions, thereby creating binaural spatial audio. The challenge lies in the complexity of textual descriptions and the limited availability of single-source sound event datasets. To address this, we propose AudioSpa, an end-to-end model that applies large language models to process both acoustic and textual information. We employ fusion multi-head attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Focus
