Generating Moving 3D Soundscapes with Latent Diffusion Models
Christian Templin, Yanda Zhu, Hao Wang

TL;DR
SonicMotion is a novel latent diffusion framework that generates 3D FOA spatial audio with moving sound sources, offering explicit control and high localization accuracy for immersive experiences.
Contribution
It introduces the first end-to-end model for dynamic FOA audio generation with natural language and spatial control, supported by a large new dataset.
Findings
Achieves state-of-the-art semantic alignment and perceptual quality.
Attains low spatial localization error.
Supports both descriptive and parametric control modes.
Abstract
Spatial audio has become central to immersive applications such as VR/AR, cinema, and music. Existing generative audio models are largely limited to mono or stereo formats and cannot capture the full 3D localization cues available in first-order Ambisonics (FOA). Recent FOA models extend text-to-audio generation but remain restricted to static sources. In this work, we introduce SonicMotion, the first end-to-end latent diffusion framework capable of generating FOA audio with explicit control over moving sound sources. SonicMotion is implemented in two variations: 1) a descriptive model conditioned on natural language prompts, and 2) a parametric model conditioned on both text and spatial trajectory parameters for higher precision. To support training and evaluation, we construct a new dataset of over one million simulated FOA caption pairs that include both static and dynamic sources…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNoise Effects and Management · Music and Audio Processing · Vehicle Noise and Vibration Control
