MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Haoran Tang; Meng Cao; Jinfa Huang; Ruyang Liu; Peng Jin; Ge Li,; Xiaodan Liang

arXiv:2408.10575·cs.CV·February 24, 2025

MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li,, Xiaodan Liang

PDF

Open Access 1 Repo 1 Video

TL;DR

MUSE introduces an efficient multi-scale learning framework for text-video retrieval, leveraging a feature pyramid and Mamba structure to enhance multi-resolution understanding with linear complexity.

Contribution

The paper proposes MUSE, a novel multi-scale learner with linear complexity that effectively models cross-resolution features for improved text-video retrieval.

Findings

01

Outperforms existing methods on three benchmarks.

02

Efficient multi-scale modeling with linear complexity.

03

Comprehensive analysis of model structures.

Abstract

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hrtang22/MUSE
pytorchOfficial

Videos

MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval· underline

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · ALIGN · Contrastive Language-Image Pre-training