NeRAF: 3D Scene Infused Neural Radiance and Acoustic Fields
Amandine Brunetto, Sascha Hornauer, Fabien Moutarde

TL;DR
NeRAF is a novel method that jointly learns 3D visual and acoustic fields, enabling high-quality, spatialized audio and visual scene synthesis with improved efficiency and versatility.
Contribution
It introduces a joint learning framework for acoustic and radiance fields, enhancing audio-visual scene synthesis and view generation from sparse data.
Findings
Achieves significant performance improvements over prior methods.
Generates high-quality, spatialized audio and visual scenes.
Enhances novel view synthesis with cross-modal learning.
Abstract
Sound plays a major role in human perception. Along with vision, it provides essential information for understanding our surroundings. Despite advances in neural implicit representations, learning acoustics that align with visual scenes remains a challenge. We propose NeRAF, a method that jointly learns acoustic and radiance fields. NeRAF synthesizes both novel views and spatialized room impulse responses (RIR) at new positions by conditioning the acoustic field on 3D scene geometric and appearance priors from the radiance field. The generated RIR can be applied to auralize any audio signal. Each modality can be rendered independently and at spatially distinct positions, offering greater versatility. We demonstrate that NeRAF generates high-quality audio on SoundSpaces and RAF datasets, achieving significant performance improvements over prior methods while being more data-efficient.…
Peer Reviews
Decision·ICLR 2025 Poster
1. **Enhanced Realism**. NeRAF uniquely integrates visual and acoustic information, using visual scene features to inform the acoustic model and vice versa. This cross-modal approach improves the fidelity of both visual and acoustic outputs, leading to more realistic audio-visual experiences without the need for additional aligned datasets. 2. **Data EVersatility**: Unlike other methods requiring dense, aligned audio-visual data, NeRAF operates effectively with sparse datasets and does not rely
1. **Problem Setting Correctness**: In section 3.4, the authors write both the audio source and two microphones' (two ears) have directions. If so, it contradicts with the claim that: Sound propagation is omnidirectional (L205), because sound propagation from directional source isn't uniformally directional and thus may not be measured by RIR (the situation becomes more complex if the microphones are also directional). Moreover, the SoundSpaces 1.0 data provided RIR are emitted by point source w
+ The paper aims for a very interesting topic: joint learning of visual and acoustic rendering for 3D scenes. The authors proposed an effective pipeline to solve this challenging task, and explore how the two modalities can combine together. + Thorough experiments are conducted on both synthetic and real datasets, comparing with reliable baselines and plenty of ablations to show the effectiveness of the proposed modules. Especially for the grid sampling visualization, from which we can see what
- Weak cross-modality learning. The proposed cross-modality learning way for acoustic learning is to use ResNet3D to extract 3D geometric and appearance priors, and then inject it for acoustic rendering. However, I doubt it due to two aspects: - The model is scene-dependent, which means fits one scene per model. In this case, it's hard to guide the ResNet3D to truly focus on properties that affect acoustic learning and physically correspond to geometry/materials etc., especially without any
- In general, I like the proposed multi-modal joint training approach that is appealing for its concise design and its potential to enhance the generated quality of both visual and acoustic fields. The flexibility to leverage single modalities when needed is also an advantage. - The paper is generally straightforward to follow. Extensive studies have been conducted on both synthetic and real datasets, along with healthy supplementary materials, code, and a demo video.
- The task of rendering on both visual and acoustic signals has been previously explored (e.g., AV-Nerf), thus limiting the novelty of this work. From a technical perspective, the primary distinctions between NeRAF and AV-Nerf lie in 1) the use of 3D scene features for RIR generation and 2) the joint training pipeline. However, these technical contributions are not particularly significant to me. - More analyses on cross-modal learning are needed. Table 2, Figure 6, appendix I and J show that jo
Videos
Taxonomy
TopicsNeural dynamics and brain function
MethodsALIGN
