SimIPU: Simple 2D Image and 3D Point Cloud Unsupervised Pre-Training for Spatial-Aware Visual Representations
Zhenyu Li, Zehui Chen, Ang Li, Liangji Fang, Qinhong Jiang, Xianming, Liu, Junjun Jiang, Bolei Zhou, Hang Zhao

TL;DR
SimIPU introduces a novel unsupervised pre-training method that enhances 2D image representations with 3D spatial awareness using multi-modal contrastive learning, improving performance on 3D-related vision tasks.
Contribution
This work is the first to apply contrastive learning pre-training to outdoor multi-modal datasets combining images and LIDAR point clouds for spatial-aware visual representations.
Findings
Effective spatial-aware representations learned from point clouds.
Successful transfer of spatial perception to image encoders.
First contrastive pre-training approach for outdoor multi-modal data.
Abstract
Pre-training has become a standard paradigm in many computer vision tasks. However, most of the methods are generally designed on the RGB image domain. Due to the discrepancy between the two-dimensional image plane and the three-dimensional space, such pre-trained models fail to perceive spatial information and serve as sub-optimal solutions for 3D-related tasks. To bridge this gap, we aim to learn a spatial-aware visual representation that can describe the three-dimensional space and is more suitable and effective for these tasks. To leverage point clouds, which are much more superior in providing spatial information compared to images, we propose a simple yet effective 2D Image and 3D Point cloud Unsupervised pre-training strategy, called SimIPU. Specifically, we develop a multi-modal contrastive learning framework that consists of an intra-modal spatial perception module to learn a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRobotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging
MethodsContrastive Learning
