AudioSpa: Spatializing Sound Events with Text

Linfeng Feng; Lei Zhao; Boyu Zhu; Xiao-Lei Zhang; and Xuelong Li

arXiv:2502.11219·eess.AS·February 18, 2025

AudioSpa: Spatializing Sound Events with Text

Linfeng Feng, Lei Zhao, Boyu Zhu, Xiao-Lei Zhang, and Xuelong Li

PDF

Open Access

TL;DR

AudioSpa is an innovative system that generates binaural spatial audio from text and monaural references, enhancing immersive sound experiences by accurately localizing sound sources using multimodal learning and data augmentation.

Contribution

This work introduces AudioSpa, the first end-to-end model for text-guided binaural audio generation with spatial localization capabilities, utilizing large language models and novel fusion techniques.

Findings

01

Achieves accurate sound source localization from text descriptions.

02

Demonstrates competitive performance in localization accuracy.

03

Employs data augmentation to improve spatialization diversity.

Abstract

Text-to-audio (TTA) systems have recently demonstrated strong performance in synthesizing monaural audio from text. However, the task of generating binaural spatial audio from text, which provides a more immersive auditory experience by incorporating the sense of spatiality, have not been explored yet. In this work, we introduce text-guided binaural audio generation. As an early effort, we focus on the scenario where a monaural reference audio is given additionally. The core problem is to associate specific sound events with their directions, thereby creating binaural spatial audio. The challenge lies in the complexity of textual descriptions and the limited availability of single-source sound event datasets. To address this, we propose AudioSpa, an end-to-end model that applies large language models to process both acoustic and textual information. We employ fusion multi-head attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Softmax · Linear Layer · Multi-Head Attention · Focus