Multi-modal Fusion and Query Refinement Network for Video Moment   Retrieval and Highlight Detection

Yifang Xu; Yunzhuo Sun; Benxiang Zhai; Zien Xie; Youyao Jia; Sidan Du

arXiv:2501.10692·cs.CV·January 22, 2025

Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection

Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Zien Xie, Youyao Jia, Sidan Du

PDF

TL;DR

This paper introduces MRNet, a multi-modal fusion and query refinement network that leverages RGB, optical flow, and depth cues for improved video moment retrieval and highlight detection, outperforming existing methods.

Contribution

The paper presents a novel multi-modal fusion module and a query refinement module that enhance video understanding by combining multiple visual signals and refining textual queries.

Findings

01

MRNet outperforms state-of-the-art methods on QVHighlights and Charades datasets.

02

Achieves +3.41 in MR-mAP@Avg and +3.46 in HD-HIT@1 on QVHighlights.

03

Demonstrates the effectiveness of multi-modal cues and query refinement in video retrieval tasks.

Abstract

Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.