UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving
Jian Zou, Tianyu Huang, Guanglei Yang, Zhenhua Guo, Tao Luo, Chun-Mei, Feng, Wangmeng Zuo

TL;DR
UniM$^2$AE introduces a multi-modal masked autoencoder framework that unifies image and LiDAR data into a 3D volume space, improving 3D perception tasks for autonomous driving through efficient multi-modal fusion.
Contribution
The paper proposes a novel multi-modal autoencoder with a unified 3D representation and an interactive module, enhancing multi-modal feature integration for autonomous driving perception tasks.
Findings
Improves 3D object detection by 1.2% NDS
Enhances BEV map segmentation by 6.5% mIoU
Demonstrates effective multi-modal fusion in autonomous driving
Abstract
Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniMAE. This model stands as a potent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Robotics and Sensor-Based Localization
MethodsMasked autoencoder
