Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video   Moment Retrieval

Zhihang Liu; Jun Li; Hongtao Xie; Pandeng Li; Jiannan Ge; Sun-Ao Liu,; Guoqing Jin

arXiv:2312.12155·cs.CV·December 20, 2023·1 cites

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

Zhihang Liu, Jun Li, Hongtao Xie, Pandeng Li, Jiannan Ge, Sun-Ao Liu,, Guoqing Jin

PDF

Open Access 1 Repo

TL;DR

This paper introduces MESM, a framework that enhances video and text features at two levels to achieve more balanced semantic alignment for improved video moment retrieval performance.

Contribution

The novel MESM framework enhances both video and textual modalities at two levels, addressing modality imbalance for better alignment in VMR tasks.

Findings

01

Achieves new state-of-the-art performance on three benchmarks.

02

Demonstrates strong generalization, especially in out-of-distribution settings.

03

Improves [email protected] by 4.42% and 7.69% on Charades-STA and Charades-CG.

Abstract

Video Moment Retrieval (VMR) aims to retrieve temporal segments in untrimmed videos corresponding to a given language query by constructing cross-modal alignment strategies. However, these existing strategies are often sub-optimal since they ignore the modality imbalance problem, \textit{i.e.}, the semantic richness inherent in videos far exceeds that of a given limited-length sentence. Therefore, in pursuit of better alignment, a natural idea is enhancing the video modality to filter out query-irrelevant semantics, and enhancing the text modality to capture more segment-relevant knowledge. In this paper, we introduce Modal-Enhanced Semantic Modeling (MESM), a novel framework for more balanced alignment through enhancing features at two levels. First, we enhance the video modality at the frame-word level through word reconstruction. This strategy emphasizes the portions associated with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lntzm/mesm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization