Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection
Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Zien Xie, Youyao Jia, Sidan Du

TL;DR
This paper introduces MRNet, a multi-modal fusion and query refinement network that leverages RGB, optical flow, and depth cues for improved video moment retrieval and highlight detection, outperforming existing methods.
Contribution
The paper presents a novel multi-modal fusion module and a query refinement module that enhance video understanding by combining multiple visual signals and refining textual queries.
Findings
MRNet outperforms state-of-the-art methods on QVHighlights and Charades datasets.
Achieves +3.41 in MR-mAP@Avg and +3.46 in HD-HIT@1 on QVHighlights.
Demonstrates the effectiveness of multi-modal cues and query refinement in video retrieval tasks.
Abstract
Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
