SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation

Yanan Wang; Linjie Ren; Zihao Li; Junyi Wang; Tian Gan

arXiv:2601.15017·cs.CV·January 30, 2026

SpatialV2A: Visual-Guided High-fidelity Spatial Audio Generation

Yanan Wang, Linjie Ren, Zihao Li, Junyi Wang, Tian Gan

PDF

Open Access

TL;DR

SpatialV2A introduces a novel framework for generating spatially accurate, immersive audio from visual cues, supported by a new large-scale binaural dataset, significantly enhancing spatial fidelity in video-to-audio synthesis.

Contribution

It presents the first large-scale binaural dataset for spatial audio generation and an end-to-end visual-guided spatial audio synthesis framework that models spatial features explicitly.

Findings

01

Outperforms state-of-the-art in spatial fidelity

02

Enhances immersive auditory experience

03

Maintains semantic and temporal alignment

Abstract

While video-to-audio generation has achieved remarkable progress in semantic and temporal alignment, most existing studies focus solely on these aspects, paying limited attention to the spatial perception and immersive quality of the synthesized audio. This limitation stems largely from current models' reliance on mono audio datasets, which lack the binaural spatial information needed to learn visual-to-spatial audio mappings. To address this gap, we introduce two key contributions: we construct BinauralVGGSound, the first large-scale video-binaural audio dataset designed to support spatially aware video-to-audio generation; and we propose a end-to-end spatial audio generation framework guided by visual cues, which explicitly models spatial features. Our framework incorporates a visual-guided audio spatialization module that ensures the generated audio exhibits realistic spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing