Video Moment Retrieval via Natural Language Queries
Xinli Yu, Mohsen Malmir, Cynthia He, Yue Liu, Rex Wu

TL;DR
This paper introduces a novel, efficient video moment retrieval method using multi-head self-attention and cross-attention mechanisms, combined with multi-task training, achieving state-of-the-art results on key metrics.
Contribution
The paper presents a new attention-based architecture with multi-task training for improved video moment retrieval performance and efficiency.
Findings
Achieved SOTA on R@1 metrics and surpassed SOTA on R@1, IoU=0.7.
Demonstrated the effectiveness of multi-head and cross-attention mechanisms.
Showed that joint training with moment segmentation improves accuracy.
Abstract
In this paper, we propose a novel method for video moment retrieval (VMR) that achieves state of the arts (SOTA) performance on R@1 metrics and surpassing the SOTA on the high IoU metric (R@1, IoU=0.7). First, we propose to use a multi-head self-attention mechanism, and further a cross-attention scheme to capture video/query interaction and long-range query dependencies from video context. The attention-based methods can develop frame-to-query interaction and query-to-frame interaction at arbitrary positions and the multi-head setting ensures the sufficient understanding of complicated dependencies. Our model has a simple architecture, which enables faster training and inference while maintaining . Second, We also propose to use multiple task training objective consists of moment segmentation task, start/end distribution prediction and start/end location regression task. We have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
