H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving
Siran Chen, Yuxiao Luo, Yue Ma, Yu Qiao, Yali Wang

TL;DR
This paper introduces H-MBA, a hierarchical adaptation framework for multi-modal video understanding in autonomous driving, improving model generalization to complex spatial-temporal scenes by capturing multi-scale context.
Contribution
The novel H-MBA framework with Context and Query Mamba modules effectively captures multi-granularity temporal context, enhancing multi-modal video understanding in autonomous driving.
Findings
Outperforms previous SOTA with 5.5% mIoU improvement in risk object detection
Effectively captures multi-scale temporal context in complex driving scenes
Demonstrates robustness across various multi-modal video tasks
Abstract
With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
