VPFusion: Joint 3D Volume and Pixel-Aligned Feature Fusion for Single and Multi-view 3D Reconstruction
Jisan Mahmud, Jan-Michael Frahm

TL;DR
VPFusion is a unified neural implicit 3D reconstruction framework that combines 3D feature volumes with pixel-aligned image features, utilizing transformer-based multi-view fusion for improved accuracy in single and multi-view settings.
Contribution
It introduces a novel transformer-based pairwise view association architecture for multi-view feature fusion in 3D reconstruction.
Findings
Outperforms existing methods on ShapeNet and ModelNet datasets.
Achieves higher reconstruction quality with combined 3D and pixel-aligned features.
Demonstrates the effectiveness of transformer-based multi-view fusion.
Abstract
We introduce a unified single and multi-view neural implicit 3D reconstruction framework VPFusion. VPFusion attains high-quality reconstruction using both - 3D feature volume to capture 3D-structure-aware context, and pixel-aligned image features to capture fine local detail. Existing approaches use RNN, feature pooling, or attention computed independently in each view for multi-view fusion. RNNs suffer from long-term memory loss and permutation variance, while feature pooling or independently computed attention leads to representation in each view being unaware of other views before the final pooling step. In contrast, we show improved multi-view feature fusion by establishing transformer-based pairwise view association. In particular, we propose a novel interleaved 3D reasoning and pairwise view association architecture for feature volume fusion across different views. Using this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis
