AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene Synthesis
Susan Liang, Chao Huang, Yapeng Tian, Anurag Kumar, Chenliang Xu

TL;DR
This paper introduces AV-NeRF, a novel neural field approach for synthesizing realistic, spatially consistent audio-visual scenes from new viewpoints and positions, leveraging geometry-aware audio generation and a new dataset.
Contribution
The paper presents a first-of-its-kind NeRF-based method for real-world audio-visual scene synthesis, integrating acoustic propagation and source-centric modeling.
Findings
Effective synthesis of novel view and position videos with matching spatial audio.
Successful application on real-world and simulated datasets.
Improved realism and spatial consistency in audio-visual scene generation.
Abstract
Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that scene. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF, in which we implicitly associate audio generation with the 3D geometry and material properties of a visual environment. Furthermore, we present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Generative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging
