M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval

Shuo Liu; Weize Quan; Ming Zhou; Sihong Chen; Jian Kang; Zhe Zhao,; Chen Chen; Dong-Ming Yan

arXiv:2208.07664·cs.MM·August 23, 2022·1 cites

M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval

Shuo Liu, Weize Quan, Ming Zhou, Sihong Chen, Jian Kang, Zhe Zhao,, Chen Chen, Dong-Ming Yan

PDF

Open Access 1 Repo

TL;DR

This paper introduces M2HF, a multi-level multi-modal fusion network that enhances text-video retrieval by exploiting detailed cross-modal interactions and achieving state-of-the-art results on multiple benchmarks.

Contribution

The paper proposes a novel multi-level hybrid fusion framework that integrates visual, audio, motion, and text features for improved text-video retrieval performance.

Findings

01

Achieved state-of-the-art Rank@1 on MSR-VTT (64.9%) and MSVD (68.2%).

02

Effectively balances multi-modal contributions with a new loss function.

03

Demonstrated superiority over existing large-scale pre-trained model methods.

Abstract

Videos contain multi-modal content, and exploring multi-level cross-modal interactions with natural language queries can provide great prominence to text-video retrieval task (TVR). However, new trending methods applying large-scale pre-trained model CLIP for TVR do not focus on multi-modal cues in videos. Furthermore, the traditional methods simply concatenating multi-modal features do not exploit fine-grained cross-modal information in videos. In this paper, we propose a multi-level multi-modal hybrid fusion (M2HF) network to explore comprehensive interactions between text queries and each modality content in videos. Specifically, M2HF first utilizes visual features extracted by CLIP to early fuse with audio and motion features extracted from videos, obtaining audio-visual fusion features and motion-visual fusion features respectively. Multi-modal alignment problem is also considered…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cshwhale/M2HF
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization