Learning to Combine the Modalities of Language and Video for Temporal Moment Localization
Jungkyoo Shin, Jinyoung Moon

TL;DR
This paper introduces a novel recurrent unit, CM-LSTM, and a two-stream attention mechanism to improve temporal moment localization in videos by better modeling long-term dependencies and boundary precision.
Contribution
The paper proposes CM-LSTM and TACI, new methods that enhance the integration of visual and semantic information for more accurate video moment localization.
Findings
TACI outperforms state-of-the-art methods on ActivityNet-Captions dataset.
Replacing LSTM with CM-LSTM improves existing models' performance.
The proposed methods achieve higher R@1 scores at IoU thresholds of 0.5 and 0.7.
Abstract
Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query. The existing methods generate the visual and semantic embeddings independently and fuse them without full consideration of the long-term temporal relationship between them. To address these shortcomings, we introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments that focuses on the part of a video segment related to the part of a query, and accumulates the contextual information across the entire video recurrently. In addition, we devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected. To obtain more precise boundaries, we propose a two-stream attentive cross-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
