MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval
Haoran Tang, Meng Cao, Jinfa Huang, Ruyang Liu, Peng Jin, Ge Li,, Xiaodan Liang

TL;DR
MUSE introduces an efficient multi-scale learning framework for text-video retrieval, leveraging a feature pyramid and Mamba structure to enhance multi-resolution understanding with linear complexity.
Contribution
The paper proposes MUSE, a novel multi-scale learner with linear complexity that effectively models cross-resolution features for improved text-video retrieval.
Findings
Outperforms existing methods on three benchmarks.
Efficient multi-scale modeling with linear complexity.
Comprehensive analysis of model structures.
Abstract
Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces · ALIGN · Contrastive Language-Image Pre-training
