Multi-Scale Contrastive Learning for Video Temporal Grounding

Thong Thanh Nguyen; Yi Bin; Xiaobao Wu; Zhiyuan Hu; Cong-Duy T Nguyen; See-Kiong Ng; Anh Tuan Luu

arXiv:2412.07157·cs.CV·April 28, 2026

Multi-Scale Contrastive Learning for Video Temporal Grounding

Thong Thanh Nguyen, Yi Bin, Xiaobao Wu, Zhiyuan Hu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu

PDF

1 Repo 1 Video

TL;DR

This paper introduces a multi-scale contrastive learning framework for video temporal grounding that effectively captures salient semantics across different video lengths without requiring data augmentation.

Contribution

It proposes a novel contrastive learning approach leveraging multi-stage video encoder features to improve temporal grounding accuracy across various video lengths.

Findings

01

Enhanced performance on long-form video grounding tasks.

02

Effective linking of local and global video moments.

03

No need for data augmentation or online memory banks.

Abstract

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level structure known as a feature pyramid. In this structure, lower levels concentrate on short-range video moments, while higher levels address long-range moments. Because higher levels experience downsampling to accommodate increasing moment length, their capacity to capture information is reduced and consequently leads to degraded information in moment representations. To resolve this problem, we propose a contrastive learning framework to capture salient semantics among video moments. Our key methodology is to leverage samples from the feature space emanating from multiple stages of the video encoder itself requiring neither data augmentation nor online…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nguyentthong/MSCL
github

Videos

Multi-Scale Contrastive Learning for Video Temporal Grounding· underline