${M^2D}$NeRF: Multi-Modal Decomposition NeRF with 3D Feature Fields
Ning Wang, Lefei Zhang, Angel X Chang

TL;DR
This paper introduces ${M^2D}$NeRF, a novel neural field model that integrates multi-modal features from visual and language models to enable semantic scene decomposition and editing in 3D.
Contribution
The work presents a unified model that incorporates multi-modal feature distillation and novel loss functions for improved 3D scene decomposition and editing capabilities.
Findings
Outperforms prior NeRF methods in scene decomposition tasks.
Enables consistent text-based and visual patch-based 3D edits.
Achieves more precise object boundaries in 3D scenes.
Abstract
Neural fields (NeRF) have emerged as a promising approach for representing continuous 3D scenes. Nevertheless, the lack of semantic encoding in NeRFs poses a significant challenge for scene decomposition. To address this challenge, we present a single model, Multi-Modal Decomposition NeRF (NeRF), that is capable of both text-based and visual patch-based edits. Specifically, we use multi-modal feature distillation to integrate teacher features from pretrained visual and language models into 3D semantic feature volumes, thereby facilitating consistent 3D editing. To enforce consistency between the visual and language features in our 3D feature volumes, we introduce a multi-modal similarity constraint. We also introduce a patch-based joint contrastive loss that helps to encourage object-regions to coalesce in the 3D feature space, resulting in more precise boundaries. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSeismic Imaging and Inversion Techniques · Computer Graphics and Visualization Techniques · Electromagnetic Scattering and Analysis
