A Hierarchical Multi-Modal Encoder for Moment Localization in Video   Corpus

Bowen Zhang; Hexiang Hu; Joonseok Lee; Ming Zhao; Sheide Chammas,; Vihan Jain; Eugene Ie; Fei Sha

arXiv:2011.09046·cs.CV·November 25, 2020·20 cites

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Bowen Zhang, Hexiang Hu, Joonseok Lee, Ming Zhao, Sheide Chammas,, Vihan Jain, Eugene Ie, Fei Sha

PDF

Open Access

TL;DR

This paper introduces HAMMER, a hierarchical multi-modal encoder that effectively captures different temporal granularities in videos to improve moment localization based on text queries, advancing video search technology.

Contribution

The paper presents a novel hierarchical encoder that encodes videos at multiple scales, significantly improving moment localization accuracy over existing methods.

Findings

01

Outperforms previous state-of-the-art on ActivityNet Captions and TVR datasets.

02

Effectively captures multi-scale temporal information for better localization.

03

Demonstrates robustness across different video datasets.

Abstract

Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation. Typical retrieval systems respond to a query with either a whole video or a pre-defined video segment, but it is challenging to localize undefined segments in untrimmed and unsegmented videos where exhaustively searching over all possible segments is intractable. The outstanding challenge is that the representation of a video must account for different levels of granularity in the temporal domain. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning