TL;DR
This paper introduces a view-consistent sampling method for NeRF training that leverages distributional regularization based on features from foundation models, improving 3D scene reconstruction especially in outdoor scenes.
Contribution
It proposes a novel view-consistent distribution sampling approach combined with a depth-pushing loss to enhance NeRF training without relying on explicit depth supervision.
Findings
Significantly improves novel view synthesis quality.
Outperforms state-of-the-art NeRF variants and depth regularization methods.
Effective in outdoor unbounded scenes.
Abstract
Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones. However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The paper is well-written and easy to understand. The motivation is reasonable and straightforward. The authors conduct comprehensive experiments on NeRF to demonstrate its effectiveness.
1. The proposed method cannot be applied to 3D Gaussian splatting, which is much faster in training and rendering compared with NeRF. The authors should discuss how their approach might be adapted to 3DGS or explain why they still chose to focus on NeRF. 2. Using high-level features and low-level RGBs to calculate the view-consistency metrics is not novel. Many former works on learning-based feature matching have explored this before. For example, [1] SuperGlue: Learning Feature Matching with G
- The proposed approach is both interesting and methodologically sound. - The paper is well-written, with a clear and well-motivated idea. - Experimental results demonstrate superior performance compared to existing methods.
- According to Table 1, the proposed method appears to significantly increase the training time. - There is a typographical error: duplicated "the" in Line 270.
1. The paper is well-organized and easy to follow. 2. Visualizations are clear and effective, aiding in understanding the concepts and improvements in visual quality. 3. Results are evaluated on two datasets with varying numbers of input images, consistently showing that the proposed method outperforms previous NeRF baselines. 4. The approach of determining surface points through feature similarity across multiple views is intuitive and promising.
1. In the original NeRF framework, a coarse MLP is used to estimate the density of sampled points for importance sampling. The proposed method replaces this with the feature similarity metric for weight computation, which, while effective, may not be as novel as claimed. It acts as an alternative rather than a completely new sampling method. 2. The proposed depth-pushing loss is conceptually similar to the distortion loss in MipNeRF360. A more detailed comparison and discussion of these approac
- The quantitative results in Table 4 and the qualitative results in Figure 4 clearly demonstrate the proposed method’s improvements. - The two novel components -- view-consistent sampling and depth-pushing loss -- are sound and well-motivated. - Overall, the paper is well-written. - The proposed regularizations are compatible with existing methods, benefiting the field of NeRF-based models.
- Since the paper proposes an efficient ray sampling technique, it should compare to other existing efficient sampling techniques, such as Coarse-to-Fine Online Distillation in Mip-NeRF360 (CVPR '22) and Probabilistic Ray Sampling in SceneRF (ICCV '23). - The impact of the distilled features' quality on rendering performance is unclear. Since these features are learned to match points across images, the paper should analyze the quantitative performance of point matching and its influence on dow
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
