GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding
Jihao Liu, Tai Wang, Boxiao Liu, Qihang Zhang, Yu Liu, Hongsheng Li

TL;DR
GeoMIM introduces a novel multi-view transformer with geometry-aware masked image modeling to effectively transfer LiDAR knowledge to improve camera-based 3D detection and segmentation, achieving state-of-the-art results.
Contribution
The paper proposes GeoMIM, a geometry-enhanced masked image modeling approach with cross-view attention, bridging the domain gap between LiDAR and camera features for better 3D understanding.
Findings
Outperforms existing methods on nuScenes benchmark
Achieves state-of-the-art camera-based 3D detection
Improves 3D segmentation accuracy
Abstract
Multi-view camera-based 3D detection is a challenging problem in computer vision. Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network. However, we argue that there is a major domain gap between the LiDAR BEV features and the camera-based BEV features, as they have different characteristics and are derived from different sources. In this paper, we propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a multi-camera vision transformer with Cross-View Attention (CVA) blocks that uses LiDAR BEV features encoded by the pretrained BEV model as learning targets. During pretraining, GeoMIM's decoder has a semantic branch completing dense perspective-view features and the other geometry…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Vision and Imaging · Robotics and Sensor-Based Localization
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer
