TL;DR
Refine3DNet introduces a hybrid CNN-transformer model with a novel training algorithm to improve 3D object reconstruction accuracy from multi-view RGB images, outperforming existing methods on ShapeNet datasets.
Contribution
The paper presents a new hybrid CNN-transformer architecture with a Joint Train Separate Optimization algorithm for enhanced 3D reconstruction from RGB images.
Findings
Outperforms state-of-the-art in 3D reconstruction accuracy.
Achieves 4.2% higher IOU in single-view reconstruction.
Effective with both single and multiple input views.
Abstract
Generating 3D models from multi-view 2D RGB images has gained significant attention, extending the capabilities of technologies like Virtual Reality, Robotic Vision, and human-machine interaction. In this paper, we introduce a hybrid strategy combining CNNs and transformers, featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network, trained using a novel Joint Train Separate Optimization (JTSO) algorithm. Encoded features from unordered inputs are transformed into an enhanced feature map by the self-attention layer, decoded into an initial 3D volume, and further refined. Our network generates 3D voxels from single or multiple 2D images from arbitrary viewpoints. Performance evaluations using the ShapeNet datasets show that our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction, achieving the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
