Deep Mamba Multi-modal Learning

Jian Zhu; Xin Zou; Yu Cui; Zhangmin Huang; Chenshu Hu; Bo Lyu

arXiv:2406.18007·cs.MM·June 27, 2024

Deep Mamba Multi-modal Learning

Jian Zhu, Xin Zou, Yu Cui, Zhangmin Huang, Chenshu Hu, Bo Lyu

PDF

Open Access

TL;DR

This paper introduces Deep Mamba Multi-modal Learning (DMML), a novel approach for multi-modal feature fusion, and proposes Deep Mamba Multi-modal Hashing (DMMH) for multimedia retrieval, achieving state-of-the-art results.

Contribution

The paper presents a new deep learning framework inspired by Mamba networks for multi-modal fusion and introduces DMMH, combining accuracy and speed for multimedia retrieval.

Findings

01

DMMH achieves state-of-the-art performance on three datasets.

02

DMML effectively fuses multi-modal features.

03

DMMH balances accuracy and inference speed.

Abstract

Inspired by the excellent performance of Mamba networks, we propose a novel Deep Mamba Multi-modal Learning (DMML). It can be used to achieve the fusion of multi-modal features. We apply DMML to the field of multimedia retrieval and propose an innovative Deep Mamba Multi-modal Hashing (DMMH) method. It combines the advantages of algorithm accuracy and inference speed. We validated the effectiveness of DMMH on three public datasets and achieved state-of-the-art results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems