MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

Xiaochun Lei; Siqi Wu; Weilin Wu; Zetao Jiang

arXiv:2506.03654·cs.CV·July 25, 2025

MambaNeXt-YOLO: A Hybrid State Space Model for Real-time Object Detection

Xiaochun Lei, Siqi Wu, Weilin Wu, Zetao Jiang

PDF

Open Access

TL;DR

MambaNeXt-YOLO introduces a hybrid model combining CNNs and linear state space models to improve real-time object detection efficiency and accuracy, especially on edge devices, by integrating novel architectural components.

Contribution

It presents a new hybrid architecture with MambaNeXt Block and MAFPN, balancing accuracy and efficiency for real-time detection on resource-limited devices.

Findings

01

Achieved 66.6% mAP at 31.9 FPS on PASCAL VOC

02

Supports deployment on NVIDIA Jetson edge devices

03

Outperforms some existing models in speed and accuracy

Abstract

Real-time object detection is a fundamental but challenging task in computer vision, particularly when computational resources are limited. Although YOLO-series models have set strong benchmarks by balancing speed and accuracy, the increasing need for richer global context modeling has led to the use of Transformer-based architectures. Nevertheless, Transformers have high computational complexity because of their self-attention mechanism, which limits their practicality for real-time and edge deployments. To overcome these challenges, recent developments in linear state space models, such as Mamba, provide a promising alternative by enabling efficient sequence modeling with linear complexity. Building on this insight, we propose MambaNeXt-YOLO, a novel object detection framework that balances accuracy and efficiency through three key contributions: (1) MambaNeXt Block: a hybrid design…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Robotics and Automated Systems · Advanced Image and Video Retrieval Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Mamba: Linear-Time Sequence Modeling with Selective State Spaces · Sparse Evolutionary Training