TL;DR
This paper introduces an attentive cross-modal relevance matching model for video moment retrieval, improving localization accuracy by modeling interactions and emphasizing important query words.
Contribution
The paper proposes a novel ACRM model that enhances cross-modal interaction modeling and incorporates an internal frame predictor for better localization accuracy.
Findings
Outperforms state-of-the-art methods on TACoS and Charades-STA datasets.
Attention module effectively emphasizes semantically rich query words.
Additional internal frame predictor improves localization precision.
Abstract
Video moment retrieval targets at retrieving a moment in a video for a given language query. The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents. To tackle those problems, early approaches adopt the sliding window or uniform sampling to collect video clips first and then match each clip with the query. Obviously, these strategies are time-consuming and often lead to unsatisfied accuracy in localization due to the unpredictable length of the golden moment. To avoid the limitations, researchers recently attempt to directly predict the relevant moment boundaries without the requirement to generate video clips first. One mainstream approach is to generate a multimodal feature vector for the target query and video frames (e.g., concatenation) and then use a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
