TL;DR
This paper introduces a multi-scale contrastive learning framework for video temporal grounding that effectively captures salient semantics across different video lengths without requiring data augmentation.
Contribution
It proposes a novel contrastive learning approach leveraging multi-stage video encoder features to improve temporal grounding accuracy across various video lengths.
Findings
Enhanced performance on long-form video grounding tasks.
Effective linking of local and global video moments.
No need for data augmentation or online memory banks.
Abstract
Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
