GranAlign: Granularity-Aware Alignment Framework for Zero-Shot Video Moment Retrieval
Mingyu Jeon, Sunjae Yoon, Jonghee Kim, Junyeoung Kim

TL;DR
GranAlign is a zero-shot video moment retrieval framework that addresses semantic granularity mismatch between text and video by using query rewriting and caption generation, achieving state-of-the-art results without task-specific training.
Contribution
The paper introduces a training-free, granularity-aware alignment framework that improves zero-shot video moment retrieval by bridging semantic gaps between modalities.
Findings
Sets new state-of-the-art on three benchmarks.
Achieves 3.23% mAP@avg improvement on QVHighlights.
Effectively resolves semantic mismatches in zero-shot retrieval.
Abstract
Zero-shot video moment retrieval (ZVMR) is the task of localizing a temporal moment within an untrimmed video using a natural language query without relying on task-specific training data. The primary challenge in this setting lies in the mismatch in semantic granularity between textual queries and visual content. Previous studies in ZVMR have attempted to achieve alignment by leveraging high-quality pre-trained knowledge that represents video and language in a joint space. However, these approaches failed to balance the semantic granularity between the pre-trained knowledge provided by each modality for a given scene. As a result, despite the high quality of each modality's representations, the mismatch in granularity led to inaccurate retrieval. In this paper, we propose a training-free framework, called Granularity-Aware Alignment (GranAlign), that bridges this gap between coarse and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
