Text-based Localization of Moments in a Video Corpus
Sudipta Paul, Niluthpol Chowdhury Mithun, and Amit K. Roy-Chowdhury

TL;DR
This paper introduces HMAN, a novel model for localizing specific moments in a large video corpus based on textual queries, addressing both retrieval and temporal localization challenges.
Contribution
The paper proposes the Hierarchical Moment Alignment Network (HMAN), which jointly learns to embed moments and sentences, improving localization across multiple videos without prior video relevance assumptions.
Findings
HMAN outperforms existing methods on Charades-STA, DiDeMo, and ActivityNet Captions datasets.
The model effectively distinguishes intra-video moments and inter-video semantic concepts.
Results show promising accuracy in retrieving and localizing moments in large video collections.
Abstract
Prior works on text-based video moment localization focus on temporally grounding the textual query in an untrimmed video. These works assume that the relevant video is already known and attempt to localize the moment on that relevant video only. Different from such works, we relax this assumption and address the task of localizing moments in a corpus of videos for a given sentence query. This task poses a unique challenge as the system is required to perform: (i) retrieval of the relevant video where only a segment of the video corresponds with the queried sentence, and (ii) temporal localization of moment in the relevant video based on sentence query. Towards overcoming this challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns an effective joint embedding space for moments and sentences. In addition to learning subtle differences between intra-video moments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
