Video Moment Retrieval via Natural Language Queries

Xinli Yu; Mohsen Malmir; Cynthia He; Yue Liu; Rex Wu

arXiv:2009.02406·cs.CV·September 11, 2020

Video Moment Retrieval via Natural Language Queries

Xinli Yu, Mohsen Malmir, Cynthia He, Yue Liu, Rex Wu

PDF

Open Access

TL;DR

This paper introduces a novel, efficient video moment retrieval method using multi-head self-attention and cross-attention mechanisms, combined with multi-task training, achieving state-of-the-art results on key metrics.

Contribution

The paper presents a new attention-based architecture with multi-task training for improved video moment retrieval performance and efficiency.

Findings

01

Achieved SOTA on R@1 metrics and surpassed SOTA on R@1, IoU=0.7.

02

Demonstrated the effectiveness of multi-head and cross-attention mechanisms.

03

Showed that joint training with moment segmentation improves accuracy.

Abstract

In this paper, we propose a novel method for video moment retrieval (VMR) that achieves state of the arts (SOTA) performance on R@1 metrics and surpassing the SOTA on the high IoU metric (R@1, IoU=0.7). First, we propose to use a multi-head self-attention mechanism, and further a cross-attention scheme to capture video/query interaction and long-range query dependencies from video context. The attention-based methods can develop frame-to-query interaction and query-to-frame interaction at arbitrary positions and the multi-head setting ensures the sufficient understanding of complicated dependencies. Our model has a simple architecture, which enables faster training and inference while maintaining . Second, We also propose to use multiple task training objective consists of moment segmentation task, start/end distribution prediction and start/end location regression task. We have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition