Progressive Multi-Modal Fusion for Robust 3D Object Detection
Rohit Mohan, Daniele Cattaneo, Florian Drews, Abhinav Valada

TL;DR
ProFusion3D is a novel multi-modal fusion framework that hierarchically combines camera and LiDAR features in both BEV and PV views, improving 3D object detection robustness and data efficiency.
Contribution
It introduces a progressive fusion architecture with hierarchical feature integration and a self-supervised pre-training strategy for enhanced multi-modal learning.
Findings
Outperforms existing methods on nuScenes and Argoverse2 datasets.
Maintains strong detection performance with sensor failure scenarios.
Enhances data efficiency through novel pre-training objectives.
Abstract
Multi-sensor fusion is crucial for accurate 3D object detection in autonomous driving, with cameras and LiDAR being the most commonly used sensors. However, existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV), thus sacrificing complementary information such as height or geometric proportions. To address this limitation, we propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels. Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection. Additionally, we introduce a self-supervised mask modeling pre-training strategy to improve multi-modal representation learning and data efficiency through three novel objectives. Extensive experiments on nuScenes…
Peer Reviews
Decision·CoRL 2024
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Advanced Neural Network Applications · Image and Object Detection Techniques
