Aligning Moments in Time using Video Queries
Yogesh Kumar, Uday Agarwal, Manish Gupta, Anand Mishra

TL;DR
This paper introduces MATR, a transformer-based model for video-to-video moment retrieval that aligns semantic and temporal features to accurately localize unseen events in videos, significantly outperforming previous methods.
Contribution
The paper presents MATR, a novel transformer model with a dual-stage sequence alignment and a self-supervised pre-training technique for improved moment localization in videos.
Findings
Achieves 13.1% improvement in R@1 on ActivityNet-VRL.
Shows 14.7% gain in R@1 on SportsMoments dataset.
Outperforms state-of-the-art methods in video moment retrieval.
Abstract
Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
