Multi-view 3D Reconstruction with Transformer

Dan Wang; Xinrui Cui; Xun Chen; Zhengxia Zou; Tianyang Shi; Septimiu; Salcudean; Z. Jane Wang; Rabab Ward

arXiv:2103.12957·cs.CV·March 25, 2021

Multi-view 3D Reconstruction with Transformer

Dan Wang, Xinrui Cui, Xun Chen, Zhengxia Zou, Tianyang Shi, Septimiu, Salcudean, Z. Jane Wang, Rabab Ward

PDF

Open Access

TL;DR

This paper introduces 3D Volume Transformer (VolT), a novel Transformer-based framework that unifies feature extraction and view fusion for multi-view 3D object reconstruction, achieving state-of-the-art accuracy with fewer parameters.

Contribution

It reformulates multi-view 3D reconstruction as a sequence prediction task using Transformers, exploring view relationships more effectively than CNN-based methods.

Findings

01

Achieves new state-of-the-art accuracy on ShapeNet dataset.

02

Uses 70% fewer parameters than CNN-based methods.

03

Demonstrates strong scalability of the proposed method.

Abstract

Deep CNN-based methods have so far achieved the state of the art results in multi-view 3D object reconstruction. Despite the considerable progress, the two core modules of these methods - multi-view feature extraction and fusion, are usually investigated separately, and the object relations in different views are rarely explored. In this paper, inspired by the recent great success in self-attention-based Transformer models, we reformulate the multi-view 3D reconstruction as a sequence-to-sequence prediction problem and propose a new framework named 3D Volume Transformer (VolT) for such a task. Unlike previous CNN-based methods using a separate design, we unify the feature extraction and view fusion in a single Transformer network. A natural advantage of our design lies in the exploration of view-to-view relationships using self-attention among multiple unordered inputs. On ShapeNet - a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Adam · Dense Connections · Softmax · Layer Normalization · Dropout