TL;DR
VoRTX introduces a transformer-based approach for volumetric 3D reconstruction that effectively fuses multi-view information, preserving fine details and handling occlusions, outperforming existing methods across multiple datasets.
Contribution
The paper presents VoRTX, a novel transformer-based network for multi-view 3D reconstruction that learns view fusion conditioned on camera pose and image content, improving detail preservation.
Findings
Outperforms state-of-the-art methods on ScanNet, TUM-RGBD, and ICL-NUIM datasets.
Effectively handles occlusions by predicting initial scene geometry.
Generalizes well without fine-tuning across different datasets.
Abstract
Recent volumetric 3D reconstruction methods can produce very accurate results, with plausible geometry even for unobserved surfaces. However, they face an undesirable trade-off when it comes to multi-view fusion. They can fuse all available view information by global averaging, thus losing fine detail, or they can heuristically cluster views for local fusion, thus restricting their ability to consider all views jointly. Our key insight is that greater detail can be retained without restricting view diversity by learning a view-fusion function conditioned on camera pose and image content. We propose to learn this multi-view fusion using a transformer. To this end, we introduce VoRTX, an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion. Our model is occlusion-aware, leveraging the transformer architecture to predict an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
