Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model

Wenbing Li; Hang Zhou; Junqing Yu; Zikai Song; Wei Yang

arXiv:2405.18014·cs.AI·June 19, 2025·5 cites

Coupled Mamba: Enhanced Multi-modal Fusion with Coupled State Space Model

Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, Wei Yang

PDF

Open Access

TL;DR

Coupled Mamba introduces a novel multi-modal fusion approach using coupled state space models that better capture modality interactions, leading to improved accuracy and efficiency in multi-domain applications.

Contribution

The paper proposes a coupled state space model that maintains intra-modality independence while effectively modeling inter-modality interactions for enhanced fusion.

Findings

01

Improved F1-Score by up to 2.3% on multiple datasets.

02

49% faster inference compared to existing methods.

03

83.7% GPU memory savings during training.

Abstract

The essence of multi-modal fusion lies in exploiting the complementary information inherent in diverse modalities. However, prevalent fusion methods rely on traditional neural architectures and are inadequately equipped to capture the dynamics of interactions across modalities, particularly in presence of complex intra- and inter-modality correlations. Recent advancements in State Space Models (SSMs), notably exemplified by the Mamba model, have emerged as promising contenders. Particularly, its state evolving process implies stronger modality fusion paradigm, making multi-modal fusion on SSMs an appealing direction. However, fusing multiple modalities is challenging for SSMs due to its hardware-aware parallelism designs. To this end, this paper proposes the Coupled SSM model, for coupling state chains of multiple modalities while maintaining independence of intra-modality state…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Speech and Audio Processing · Advanced Vision and Imaging

MethodsConvolution