SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation
Yanan Wang, Linjie Ren, Zihao Li, Junyi Wang, Tian Gan

TL;DR
SpatialV2A introduces a novel framework for generating spatially accurate, immersive audio from visual cues, supported by a new large-scale binaural dataset, significantly enhancing spatial fidelity in video-to-audio synthesis.
Contribution
It presents the first large-scale binaural dataset for spatial audio generation and an end-to-end visual-guided spatial audio synthesis framework that models spatial features explicitly.
Findings
Outperforms state-of-the-art in spatial fidelity
Enhances immersive auditory experience
Maintains semantic and temporal alignment
Abstract
While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models' reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
