NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields

Amandine Brunetto; Sascha Hornauer; Fabien Moutarde

arXiv:2405.18213·cs.SD·October 3, 2025

NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields

Amandine Brunetto, Sascha Hornauer, Fabien Moutarde

PDF

Open Access 1 Video 3 Reviews

TL;DR

NeRAF is a novel method that jointly learns 3D visual and acoustic fields, enabling high-quality, spatialized audio and visual scene synthesis with improved efficiency and versatility.

Contribution

It introduces a joint learning framework for acoustic and radiance fields, enhancing audio-visual scene synthesis and view generation from sparse data.

Findings

01

Achieves significant performance improvements over prior methods.

02

Generates high-quality, spatialized audio and visual scenes.

03

Enhances novel view synthesis with cross-modal learning.

Abstract

Sound plays a major role in human perception. Along with vision, it provides essential information for understanding our surroundings. Despite advances in neural implicit representations, learning acoustics that align with visual scenes remains a challenge. We propose NeRAF, a method that jointly learns acoustic and radiance fields. NeRAF synthesizes both novel views and spatialized room impulse responses (RIR) at new positions by conditioning the acoustic field on 3D scene geometric and appearance priors from the radiance field. The generated RIR can be applied to auralize any audio signal. Each modality can be rendered independently and at spatially distinct positions, offering greater versatility. We demonstrate that NeRAF generates high-quality audio on SoundSpaces and RAF datasets, achieving significant performance improvements over prior methods while being more data-efficient.…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

1. **Enhanced Realism**. NeRAF uniquely integrates visual and acoustic information, using visual scene features to inform the acoustic model and vice versa. This cross-modal approach improves the fidelity of both visual and acoustic outputs, leading to more realistic audio-visual experiences without the need for additional aligned datasets. 2. **Data EVersatility**: Unlike other methods requiring dense, aligned audio-visual data, NeRAF operates effectively with sparse datasets and does not rely

Weaknesses

1. **Problem Setting Correctness**: In section 3.4, the authors write both the audio source and two microphones' (two ears) have directions. If so, it contradicts with the claim that: Sound propagation is omnidirectional (L205), because sound propagation from directional source isn't uniformally directional and thus may not be measured by RIR (the situation becomes more complex if the microphones are also directional). Moreover, the SoundSpaces 1.0 data provided RIR are emitted by point source w

Reviewer 02Rating 5Confidence 5

Strengths

+ The paper aims for a very interesting topic: joint learning of visual and acoustic rendering for 3D scenes. The authors proposed an effective pipeline to solve this challenging task, and explore how the two modalities can combine together. + Thorough experiments are conducted on both synthetic and real datasets, comparing with reliable baselines and plenty of ablations to show the effectiveness of the proposed modules. Especially for the grid sampling visualization, from which we can see what

Weaknesses

- Weak cross-modality learning. The proposed cross-modality learning way for acoustic learning is to use ResNet3D to extract 3D geometric and appearance priors, and then inject it for acoustic rendering. However, I doubt it due to two aspects: - The model is scene-dependent, which means fits one scene per model. In this case, it's hard to guide the ResNet3D to truly focus on properties that affect acoustic learning and physically correspond to geometry/materials etc., especially without any

Reviewer 03Rating 6Confidence 4

Strengths

- In general, I like the proposed multi-modal joint training approach that is appealing for its concise design and its potential to enhance the generated quality of both visual and acoustic fields. The flexibility to leverage single modalities when needed is also an advantage. - The paper is generally straightforward to follow. Extensive studies have been conducted on both synthetic and real datasets, along with healthy supplementary materials, code, and a demo video.

Weaknesses

- The task of rendering on both visual and acoustic signals has been previously explored (e.g., AV-Nerf), thus limiting the novelty of this work. From a technical perspective, the primary distinctions between NeRAF and AV-Nerf lie in 1) the use of 3D scene features for RIR generation and 2) the joint training pipeline. However, these technical contributions are not particularly significant to me. - More analyses on cross-modal learning are needed. Table 2, Figure 6, appendix I and J show that jo

Videos

NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields· slideslive

Taxonomy

TopicsNeural dynamics and brain function

MethodsALIGN