Aligning Moments in Time using Video Queries

Yogesh Kumar; Uday Agarwal; Manish Gupta; Anand Mishra

arXiv:2508.15439·cs.CV·September 3, 2025

Aligning Moments in Time using Video Queries

Yogesh Kumar, Uday Agarwal, Manish Gupta, Anand Mishra

PDF

Open Access

TL;DR

This paper introduces MATR, a transformer-based model for video-to-video moment retrieval that aligns semantic and temporal features to accurately localize unseen events in videos, significantly outperforming previous methods.

Contribution

The paper presents MATR, a novel transformer model with a dual-stage sequence alignment and a self-supervised pre-training technique for improved moment localization in videos.

Findings

01

Achieves 13.1% improvement in R@1 on ActivityNet-VRL.

02

Shows 14.7% gain in R@1 on SportsMoments dataset.

03

Outperforms state-of-the-art methods in video moment retrieval.

Abstract

Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization