MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters
Jianhong Han, Yupei Wang, Yuan Zhang, Liang Chen

TL;DR
MM-DETR introduces a lightweight, efficient multimodal detection framework that employs a novel dual-granularity fusion encoder and frequency-aware adapters to enhance feature extraction and modality-specific modeling with minimal overhead.
Contribution
The paper presents a novel Mamba-based dual granularity fusion encoder and frequency-aware modality adapters, improving multimodal detection efficiency and performance over existing methods.
Findings
Outperforms existing methods on four benchmark datasets.
Achieves a good balance between performance and model complexity.
Demonstrates strong generalization capability across diverse datasets.
Abstract
Multimodal remote sensing object detection aims to achieve more accurate and robust perception under challenging conditions by fusing complementary information from different modalities. However, existing approaches that rely on attention-based or deformable convolution fusion blocks still struggle to balance performance and lightweight design. Beyond fusion complexity, extracting modality features with shared backbones yields suboptimal representations due to insufficient modality-specific modeling, whereas dual-stream architectures nearly double the parameter count, ultimately limiting practical deployment. To this end, we propose MM-DETR, a lightweight and efficient framework for multimodal object detection. Specifically, we propose a Mamba-based dual granularity fusion encoder that reformulates global interaction as channel-wise dynamic gating and leverages a 1D selective scan for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image Fusion Techniques · Remote-Sensing Image Classification
