M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval
Shuo Liu, Weize Quan, Ming Zhou, Sihong Chen, Jian Kang, Zhe Zhao,, Chen Chen, Dong-Ming Yan

TL;DR
This paper introduces M2HF, a multi-level multi-modal fusion network that enhances text-video retrieval by exploiting detailed cross-modal interactions and achieving state-of-the-art results on multiple benchmarks.
Contribution
The paper proposes a novel multi-level hybrid fusion framework that integrates visual, audio, motion, and text features for improved text-video retrieval performance.
Findings
Achieved state-of-the-art Rank@1 on MSR-VTT (64.9%) and MSVD (68.2%).
Effectively balances multi-modal contributions with a new loss function.
Demonstrated superiority over existing large-scale pre-trained model methods.
Abstract
Videos contain multi-modal content, and exploring multi-level cross-modal interactions with natural language queries can provide great prominence to text-video retrieval task (TVR). However, new trending methods applying large-scale pre-trained model CLIP for TVR do not focus on multi-modal cues in videos. Furthermore, the traditional methods simply concatenating multi-modal features do not exploit fine-grained cross-modal information in videos. In this paper, we propose a multi-level multi-modal hybrid fusion (M2HF) network to explore comprehensive interactions between text queries and each modality content in videos. Specifically, M2HF first utilizes visual features extracted by CLIP to early fuse with audio and motion features extracted from videos, obtaining audio-visual fusion features and motion-visual fusion features respectively. Multi-modal alignment problem is also considered…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
