TL;DR
MambaFusion introduces a height-fidelity dense global fusion method using a novel Mamba block, achieving state-of-the-art multi-modal 3D object detection performance while maintaining efficiency and preserving scene height information.
Contribution
The paper proposes a new height-fidelity LiDAR encoding and Hybrid Mamba Block for efficient, long-range, and complete scene information fusion in multi-modal 3D detection.
Findings
Achieves 75.0 NDS score on nuScenes benchmark.
Surpasses high-resolution input methods in performance.
Maintains faster inference speed than recent state-of-the-art methods.
Abstract
We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in state-space models (SSMs) and linear attention, we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
