H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding   in Autonomous Driving

Siran Chen; Yuxiao Luo; Yue Ma; Yu Qiao; Yali Wang

arXiv:2501.04302·cs.CV·January 9, 2025

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving

Siran Chen, Yuxiao Luo, Yue Ma, Yu Qiao, Yali Wang

PDF

Open Access 1 Video

TL;DR

This paper introduces H-MBA, a hierarchical adaptation framework for multi-modal video understanding in autonomous driving, improving model generalization to complex spatial-temporal scenes by capturing multi-scale context.

Contribution

The novel H-MBA framework with Context and Query Mamba modules effectively captures multi-granularity temporal context, enhancing multi-modal video understanding in autonomous driving.

Findings

01

Outperforms previous SOTA with 5.5% mIoU improvement in risk object detection

02

Effectively captures multi-scale temporal context in complex driving scenes

03

Demonstrates robustness across various multi-modal video tasks

Abstract

With the prevalence of Multimodal Large Language Models(MLLMs), autonomous driving has encountered new opportunities and challenges. In particular, multi-modal video understanding is critical to interactively analyze what will happen in the procedure of autonomous driving. However, videos in such a dynamical scene that often contains complex spatial-temporal movements, which restricts the generalization capacity of the existing MLLMs in this field. To bridge the gap, we propose a novel Hierarchical Mamba Adaptation (H-MBA) framework to fit the complicated motion changes in autonomous driving videos. Specifically, our H-MBA consists of two distinct modules, including Context Mamba (C-Mamba) and Query Mamba (Q-Mamba). First, C-Mamba contains various types of structure state space models, which can effectively capture multi-granularity video context for different temporal resolutions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

H-MBA: Hierarchical MamBa Adaptation for Multi-Modal Video Understanding in Autonomous Driving· underline

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces