GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction
Xiao Chen, Quanyi Li, Tai Wang, Tianfan Xue, Jiangmiao, Pang

TL;DR
GenNBV introduces a reinforcement learning-based, generalizable next-best-view policy for active 3D reconstruction, capable of handling unseen geometries and large action spaces, significantly improving coverage on new datasets.
Contribution
The paper presents GenNBV, a novel RL-based NBV policy with multi-source state embedding that generalizes across datasets and extends action space to 5D for active 3D scanning.
Findings
Achieves over 98% coverage on unseen datasets.
Outperforms prior NBV methods in generalization.
Effective in large-scale, unseen geometries.
Abstract
While recent advances in neural radiance field enable realistic digitization for large-scale scenes, the image-capturing process is still time-consuming and labor-intensive. Previous works attempt to automate this process using the Next-Best-View (NBV) policy for active 3D reconstruction. However, the existing NBV policies heavily rely on hand-crafted criteria, limited action space, or per-scene optimized representations. These constraints limit their cross-dataset generalizability. To overcome them, we propose GenNBV, an end-to-end generalizable NBV policy. Our policy adopts a reinforcement learning (RL)-based framework and extends typical limited action space to 5D free space. It empowers our agent drone to scan from any viewpoint, and even interact with unseen geometries during training. To boost the cross-dataset generalizability, we also propose a novel multi-source state…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
### Improved generalization ability for NBV learning The authors have proposed an RL policy that trains from diverse 3D object simulations for NBV. With the trained policy, the authors have shown that it can be generalized to a new dataset with real-world collection. ### Ablation studies to support the embedding strategy The authors have provided an ablation study to show the effectiveness of the proposed embedding strategies. The result shows that the proposed multimodal representation helps
### Missing related work NBV is not a task with learning-based methods. This paper mainly discusses recent works that use the radiance field as the mapping module. However, there are a bunch of works that use classical methods for NBV, e.g. [A][B]. I would suggest the authors refer to [C] for a more detailed survey of this field. [A] A comparison of volumetric information gain metrics for active 3d object reconstruction [B] An information gain formulation for active volumetric 3d reconstruction
- The overall paper is well motivated and written, which is straightforward to follow. - Experiments on Houses3K, OmniObject3D datasets shows that there is a performance gain compared to recent baselines of active reconstruction.
Though achieving promising generalization capability on novel synthetic scenes, I still have a few concerns towards the evaluation and practicability of GENNBV. 1. The experiments are limited in synthetic set-up, which is reasonable considering the RL pipeline. However, there is no practical demonstrations of how this would transfer to real-world reconstructions when the dynamic of agents, the captured RGB-D frames and poses will be imperfect and suffer from their physic limitations. How does th
- The paper formulated the next-best-view task in the context of reinforcement learning by defining state, action, and reward. - An RL framework specifically tailored for the NBV problem is introduced taking images and actions for training policy net that predict next best view for 3D reconstruction.
- Limited Technical Novelty -The paper's technical novelty appears constrained, primarily focusing on presenting the NBV as a reinforcement learning task. Notably, this is not the first work to do so, with Scan-RL having previously introduced RL-based approach for NBV. Further, the proposed RL framework doesn't markedly deviate from Scan-RL's approach, with the primary distinction being the geometric representation derived from depth maps. - Insufficient Experimental Evidence - The paper lac
The results in Tab. 2 are interesting. How different factors in the unimodal and multimodal settings affect the final results are worth studying with in-depth analysis.
1. Confusing setting. - The title indicates that the proposed method is designed for '3D reconstruction'. However, the evaluation metrics only evaluate the completeness/coverage ratio but not the reconstruction accuracy. It is unclear how the reconstructed scenes are visualized in Fig. 1-4 and the supplementary video. The grid size of the map is also unclear. - It is unclear why the paper mentions and compares against the NeRF-based methods as the proposed method utilizes conventional voxel grid
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Image Segmentation Techniques · 3D Shape Modeling and Analysis · Medical Imaging Techniques and Applications
