SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation
Kien T. Pham, Yingqing He, Yazhou Xing, Qifeng Chen, Long Chen

TL;DR
SpA2V introduces a novel framework that leverages spatial auditory cues from audio inputs to generate videos with accurate semantic content and spatial positioning, advancing the realism and alignment in audio-driven video synthesis.
Contribution
This work is the first to explicitly exploit spatial auditory cues for high-quality, spatially-aware video generation from audio inputs.
Findings
Outperforms existing methods in semantic and spatial alignment
Generates realistic videos with high fidelity to input audio cues
Operates in a training-free manner using pre-trained diffusion models
Abstract
Audio-driven video generation aims to synthesize realistic videos that align with input audio recordings, akin to the human ability to visualize scenes from auditory input. However, existing approaches predominantly focus on exploring semantic information, such as the classes of sounding sources present in the audio, limiting their ability to generate videos with accurate content and spatial composition. In contrast, we humans can not only naturally identify the semantic categories of sounding sources but also determine their deeply encoded spatial attributes, including locations and movement directions. This useful information can be elucidated by considering specific spatial indicators derived from the inherent physical properties of sound, such as loudness or frequency. As prior methods largely ignore this factor, we present SpA2V, the first framework explicitly exploits these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
