PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection
Anthony Chen, Kevin Zhang, Renrui Zhang, Zihan Wang, Yuheng Lu,, Yandong Guo, Shanghang Zhang

TL;DR
PiMAE is a novel self-supervised pre-training framework that enhances 3D object detection by promoting interaction between point cloud and RGB image modalities through innovative masking, cross-modal alignment, and shared decoding strategies.
Contribution
The paper introduces PiMAE, a new multi-modality masked autoencoder framework that significantly improves 3D and 2D detection performance by fostering cross-modal interactions.
Findings
Improves 3D detectors by 2.9%
Enhances 2D detectors by 6.7%
Boosts few-shot classifiers by 2.4%
Abstract
Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world, and explore their meaningful interactions. To improve upon the cross-modal synergy in existing works, we propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects. Specifically, we first notice the importance of masking strategies between the two sources and utilize a projection module to complementarily align the mask and visible tokens of the two modalities. Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared decoder to promote cross-modality interaction in the mask tokens.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Advanced Neural Network Applications · 3D Shape Modeling and Analysis
MethodsMasked autoencoder · ALIGN
