ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Mengchen Zhang; Qi Chen; Tong Wu; Zihan Liu; Dahua Lin

arXiv:2512.03036·cs.CV·December 3, 2025

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin

PDF

Open Access

TL;DR

ViSAudio is an end-to-end framework that generates spatially immersive binaural audio directly from silent videos, improving realism and consistency over previous two-stage methods.

Contribution

It introduces the novel task of end-to-end video-driven binaural audio generation and provides a large dataset, BiAudio, to support this research.

Findings

01

Outperforms existing methods in objective metrics

02

Produces high-quality, spatially immersive audio

03

Adapts effectively to viewpoint and sound-source changes

Abstract

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Speech and Audio Processing · Hearing Loss and Rehabilitation