Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
Peiwen Sun, Sitong Cheng, Xiangtai Li, Zhen Ye, Huadai Liu, Honggang, Zhang, Wei Xue, Yike Guo

TL;DR
This paper introduces a novel multimodal diffusion-based approach for controllable stereo spatial audio generation from text, images, and retrieval, addressing challenges of complexity, data costs, and stability.
Contribution
It presents the first large-scale, simulation-based dataset and a spatial-aware diffusion model that enables accurate, immersive, and controllable spatial audio synthesis from multiple modalities.
Findings
Effective spatial audio generation from text and images
Outperforms existing methods in realism and controllability
Demonstrates physical plausibility in generated spatial sound
Abstract
Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for Latent…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. Innovative Approach: The paper proposes a unique one-stage generation framework that integrates multimodal inputs (text and images) with azimuth-aware guidance, enabling controllable stereo audio synthesis. 2. Significant Dataset Contribution: The BEWO-1M dataset is a notable contribution, providing a large-scale foundation for training and evaluating spatial audio models. Its combination of audio, captions, and simulated spatial attributes is a strong asset for advancing future research.
1. Dependence on Simulated Data: The BEWO-1M dataset, while substantial, is built largely on simulated and GPT-transformed data. This could limit the model's ability to generalize to real-world audio data, which typically exhibits greater variability. Can you show the results on any related real-world datasets, e.g., dataset in [1]? 2. Given the synthetic nature of BEWO-1M, how do you plan to adapt or extend the dataset to include more real-world audio data? Have you considered potential data
* The problem itself is novel one as I don't think I've seen other works tackle the spatial audio generation problem in the same modern setup, e.g. via LDM. * I think the dataset portion was the most interesting and seemingly rigorous part of the work and the spatial audio generation was the least. The dataset collection pipeline in general seemed like a sensible approach, short of collecting it manually, whereas the spatial audio generation portion felt like a fairly straightforward application
* I think the biggest challenge I had was that I was not qualitatively convinced of the quality from the demos. There were maybe only 2-3 samples in total where it felt like it was doing what was expected spatially, eg: "A duck is quacking on the left." or "A dog is barking on the right." don't actually sound like they're coming from their respective directions. That said, the framing of the paper's subjective results is that it is better _relative_ to competing methods, which is possible, but i
1. The sterro audio generation is less studied than general mono channel audio generation but is meaningful for immersive experiences, and is a good research topic. 2. Proposed dataset BEWO-1M is the first large-scale spatial audio dataset which is labeled through automatic pipelines, the construction pipeline sounds reasonable and I believe the dataset could be helpful for future research of this field. 3. Proposed method are proven effective on both mono-channel benchmarks and proposed dual-ch
1. The descriptions of dataset construction pipeline and model framework are not easy to follow enough.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies
MethodsDiffusion · Sparse Evolutionary Training
