Boosting Video-Text Retrieval with Explicit High-Level Semantics

Haoran Wang; Di Xu; Dongliang He; Fu Li; Zhong Ji; Jungong Han; Errui; Ding

arXiv:2208.04215·cs.CV·August 10, 2022

Boosting Video-Text Retrieval with Explicit High-Level Semantics

Haoran Wang, Di Xu, Dongliang He, Fu Li, Zhong Ji, Jungong Han, Errui, Ding

PDF

Open Access

TL;DR

This paper introduces HiSE, a novel model for video-text retrieval that incorporates explicit high-level semantic information from both modalities, significantly improving cross-modal alignment and retrieval performance.

Contribution

The work proposes a hierarchical high-level semantic modeling approach for VTR, decomposing semantics into discrete and holistic levels and integrating them via graph reasoning.

Findings

01

Achieves superior performance on MSR-VTT, MSVD, and DiDeMo datasets.

02

Effectively models high-level semantics to improve cross-modal alignment.

03

Outperforms state-of-the-art methods in video-text retrieval tasks.

Abstract

Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which improves the cross-modal representation by incorporating explicit high-level semantics. First, we explore the hierarchical property of explicit high-level semantics, and further decompose it into two levels, i.e. discrete semantics and holistic semantics. Specifically, for visual branch, we exploit an off-the-shelf semantic entity predictor to generate discrete high-level semantics. In parallel, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsALIGN