A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus
Bowen Zhang, Hexiang Hu, Joonseok Lee, Ming Zhao, Sheide Chammas,, Vihan Jain, Eugene Ie, Fei Sha

TL;DR
This paper introduces HAMMER, a hierarchical multi-modal encoder that effectively captures different temporal granularities in videos to improve moment localization based on text queries, advancing video search technology.
Contribution
The paper presents a novel hierarchical encoder that encodes videos at multiple scales, significantly improving moment localization accuracy over existing methods.
Findings
Outperforms previous state-of-the-art on ActivityNet Captions and TVR datasets.
Effectively captures multi-scale temporal information for better localization.
Demonstrates robustness across different video datasets.
Abstract
Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation. Typical retrieval systems respond to a query with either a whole video or a pre-defined video segment, but it is challenging to localize undefined segments in untrimmed and unsegmented videos where exhaustively searching over all possible segments is intractable. The outstanding challenge is that the representation of a video must account for different levels of granularity in the temporal domain. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
