GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling   for Multi-view 3D Understanding

Jihao Liu; Tai Wang; Boxiao Liu; Qihang Zhang; Yu Liu; Hongsheng Li

arXiv:2303.11325·cs.CV·August 29, 2023·1 cites

GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Jihao Liu, Tai Wang, Boxiao Liu, Qihang Zhang, Yu Liu, Hongsheng Li

PDF

Open Access 1 Repo

TL;DR

GeoMIM introduces a novel multi-view transformer with geometry-aware masked image modeling to effectively transfer LiDAR knowledge to improve camera-based 3D detection and segmentation, achieving state-of-the-art results.

Contribution

The paper proposes GeoMIM, a geometry-enhanced masked image modeling approach with cross-view attention, bridging the domain gap between LiDAR and camera features for better 3D understanding.

Findings

01

Outperforms existing methods on nuScenes benchmark

02

Achieves state-of-the-art camera-based 3D detection

03

Improves 3D segmentation accuracy

Abstract

Multi-view camera-based 3D detection is a challenging problem in computer vision. Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network. However, we argue that there is a major domain gap between the LiDAR BEV features and the camera-based BEV features, as they have different characteristics and are derived from different sources. In this paper, we propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a multi-camera vision transformer with Cross-View Attention (CVA) blocks that uses LiDAR BEV features encoded by the pretrained BEV model as learning targets. During pretraining, GeoMIM's decoder has a semantic branch completing dense perspective-view features and the other geometry…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sense-x/geomim
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Vision and Imaging · Robotics and Sensor-Based Localization

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer