Learning to Combine the Modalities of Language and Video for Temporal   Moment Localization

Jungkyoo Shin; Jinyoung Moon

arXiv:2109.02925·cs.CV·September 8, 2021

Learning to Combine the Modalities of Language and Video for Temporal Moment Localization

Jungkyoo Shin, Jinyoung Moon

PDF

Open Access

TL;DR

This paper introduces a novel recurrent unit, CM-LSTM, and a two-stream attention mechanism to improve temporal moment localization in videos by better modeling long-term dependencies and boundary precision.

Contribution

The paper proposes CM-LSTM and TACI, new methods that enhance the integration of visual and semantic information for more accurate video moment localization.

Findings

01

TACI outperforms state-of-the-art methods on ActivityNet-Captions dataset.

02

Replacing LSTM with CM-LSTM improves existing models' performance.

03

The proposed methods achieve higher R@1 scores at IoU thresholds of 0.5 and 0.7.

Abstract

Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query. The existing methods generate the visual and semantic embeddings independently and fuse them without full consideration of the long-term temporal relationship between them. To address these shortcomings, we introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments that focuses on the part of a video segment related to the part of a query, and accumulates the contextual information across the entire video recurrently. In addition, we devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected. To obtain more precise boundaries, we propose a two-stream attentive cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory