MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection

Hanshi Wang; Jin Gao; Weiming Hu; Zhipeng Zhang

arXiv:2507.04369·cs.CV·July 8, 2025

MambaFusion: Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection

Hanshi Wang, Jin Gao, Weiming Hu, Zhipeng Zhang

PDF

1 Repo

TL;DR

MambaFusion introduces a height-fidelity dense global fusion method using a novel Mamba block, achieving state-of-the-art multi-modal 3D object detection performance while maintaining efficiency and preserving scene height information.

Contribution

The paper proposes a new height-fidelity LiDAR encoding and Hybrid Mamba Block for efficient, long-range, and complete scene information fusion in multi-modal 3D detection.

Findings

01

Achieves 75.0 NDS score on nuScenes benchmark.

02

Surpasses high-resolution input methods in performance.

03

Maintains faster inference speed than recent state-of-the-art methods.

Abstract

We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in state-space models (SSMs) and linear attention, we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AutoLab-SAI-SJTU/MambaFusion
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings