Boosting Video-Text Retrieval with Explicit High-Level Semantics
Haoran Wang, Di Xu, Dongliang He, Fu Li, Zhong Ji, Jungong Han, Errui, Ding

TL;DR
This paper introduces HiSE, a novel model for video-text retrieval that incorporates explicit high-level semantic information from both modalities, significantly improving cross-modal alignment and retrieval performance.
Contribution
The work proposes a hierarchical high-level semantic modeling approach for VTR, decomposing semantics into discrete and holistic levels and integrating them via graph reasoning.
Findings
Achieves superior performance on MSR-VTT, MSVD, and DiDeMo datasets.
Effectively models high-level semantics to improve cross-modal alignment.
Outperforms state-of-the-art methods in video-text retrieval tasks.
Abstract
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsALIGN
