In-the-wild Audio Spatialization with Flexible Text-guided Localization
Tianrui Pan, Jie Liu, Zewen Huang, Jie Tang, Gangshan Wu

TL;DR
This paper introduces a flexible text-guided framework for audio spatialization in immersive environments, utilizing a large-scale dataset and a novel assessment model to improve spatial accuracy and semantic coherence.
Contribution
It presents the TAS framework with a new dataset and an evaluation method, enabling interactive and accurate binaural audio generation guided by text prompts.
Findings
Outperforms existing methods on simulated and real datasets
Demonstrates superior generalization and spatial accuracy
Achieves high semantic coherence with text prompts
Abstract
To enhance immersive experiences, binaural audio offers spatial awareness of sounding objects in AR, VR, and embodied AI applications. While existing audio spatialization methods can generally map any available monaural audio to binaural audio signals, they often lack the flexible and interactive control needed in complex multi-object user-interactive environments. To address this, we propose a Text-guided Audio Spatialization (TAS) framework that utilizes flexible text prompts and evaluates our model from unified generation and comprehension perspectives. Due to the limited availability of premium and large-scale stereo data, we construct the SpatialTAS dataset, which encompasses 376,000 simulated binaural audio samples to facilitate the training of our model. Our model learns binaural differences guided by 3D spatial location and relative position prompts, augmented by flipped-channel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
