AutoAlign: Pixel-Instance Feature Aggregation for Multi-Modal 3D Object Detection
Zehui Chen, Zhenyu Li, Shiquan Zhang, Liangji Fang, Qinghong Jiang,, Feng Zhao, Bolei Zhou, Hang Zhao

TL;DR
AutoAlign introduces a learnable, data-driven feature fusion method for multi-modal 3D object detection, significantly improving detection accuracy by adaptively aligning image and LiDAR features.
Contribution
The paper proposes a novel automatic feature alignment strategy using a learnable map and cross-attention modules for enhanced multi-modal 3D detection.
Findings
Achieves 2.3 mAP improvement on KITTI dataset
Achieves 7.0 mAP improvement on nuScenes dataset
Reaches 70.9 NDS on nuScenes leaderboard
Abstract
Object detection through either RGB images or the LiDAR point clouds has been extensively explored in autonomous driving. However, it remains challenging to make these two data sources complementary and beneficial to each other. In this paper, we propose \textit{AutoAlign}, an automatic feature fusion strategy for 3D object detection. Instead of establishing deterministic correspondence with camera projection matrix, we model the mapping relationship between the image and point clouds with a learnable alignment map. This map enables our model to automate the alignment of non-homogenous features in a dynamic and data-driven manner. Specifically, a cross-attention feature alignment module is devised to adaptively aggregate \textit{pixel-level} image features for each voxel. To enhance the semantic consistency during feature alignment, we also design a self-supervised cross-modal feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Visual Attention and Saliency Detection
