MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection
Xiaochun Lei, Siqi Wu, Weilin Wu, Zetao Jiang

TL;DR
MambaNeXt-YOLO introduces a hybrid model combining CNNs and linear state space models to improve real-time object detection efficiency and accuracy, especially on edge devices, by integrating novel architectural components.
Contribution
It presents a new hybrid architecture with MambaNeXt Block and MAFPN, balancing accuracy and efficiency for real-time detection on resource-limited devices.
Findings
Achieved 66.6% mAP at 31.9 FPS on PASCAL VOC
Supports deployment on NVIDIA Jetson edge devices
Outperforms some existing models in speed and accuracy
Abstract
Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Automated Systems · Advanced Image and Video Retrieval Techniques
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Sparse Evolutionary Training
