SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Kien T. Pham; Yingqing He; Yazhou Xing; Qifeng Chen; Long Chen

arXiv:2508.00782·cs.GR·March 17, 2026

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation

Kien T. Pham, Yingqing He, Yazhou Xing, Qifeng Chen, Long Chen

PDF

TL;DR

SpA2V introduces a novel framework that leverages spatial auditory cues from audio inputs to generate videos with accurate semantic content and spatial positioning, advancing the realism and alignment in audio-driven video synthesis.

Contribution

This work is the first to explicitly exploit spatial auditory cues for high-quality, spatially-aware video generation from audio inputs.

Findings

01

Outperforms existing methods in semantic and spatial alignment

02

Generates realistic videos with high fidelity to input audio cues

03

Operates in a training-free manner using pre-trained diffusion models

Abstract

Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.