MVT: Multi-view Vision Transformer for 3D Object Recognition

Shuo Chen; Tan Yu; Ping Li

arXiv:2110.13083·cs.CV·October 26, 2021·20 cites

MVT: Multi-view Vision Transformer for 3D Object Recognition

Shuo Chen, Tan Yu, Ping Li

PDF

Open Access 2 Repos

TL;DR

This paper introduces MVT, a multi-view vision transformer that effectively models inter-view patch communications for 3D object recognition, outperforming CNN-based methods with less bias.

Contribution

The paper proposes a novel Multi-view Vision Transformer that captures cross-view patch interactions and combines global-local structures for improved 3D recognition.

Findings

01

Achieves competitive results on ModelNet40 and ModelNet10 benchmarks.

02

Outperforms traditional CNN-based multi-view methods.

03

Demonstrates the effectiveness of vision transformers in 3D recognition tasks.

Abstract

Inspired by the great success achieved by CNN in image recognition, view-based methods applied CNNs to model the projected views for 3D object understanding and achieved excellent performance. Nevertheless, multi-view CNN models cannot model the communications between patches from different views, limiting its effectiveness in 3D object recognition. Inspired by the recent success gained by vision Transformer in image recognition, we propose a Multi-view Vision Transformer (MVT) for 3D object recognition. Since each patch feature in a Transformer block has a global reception field, it naturally achieves communications between patches from different views. Meanwhile, it takes much less inductive bias compared with its CNN counterparts. Considering both effectiveness and efficiency, we develop a global-local structure for our MVT. Our experiments on two public benchmarks, ModelNet40 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dense Connections · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Adam · Dropout