CVA: Context-aware Video-text Alignment for Video Temporal Grounding

Sungho Moon; Seunghun Lee; Jiwan Seo; Sunghoon Im

arXiv:2603.24934·cs.LG·March 27, 2026

CVA: Context-aware Video-text Alignment for Video Temporal Grounding

Sungho Moon, Seunghun Lee, Jiwan Seo, Sunghoon Im

PDF

Open Access

TL;DR

CVA is a comprehensive framework that improves video-text alignment for temporal grounding by combining novel data augmentation, contrastive loss, and a hierarchical transformer architecture, achieving state-of-the-art results.

Contribution

The paper introduces three innovative components—QCD, CBD loss, and CTE—that together enhance the robustness and accuracy of video-text temporal grounding.

Findings

01

Achieves approximately 5-point improvement in Recall@1 scores.

02

Outperforms existing methods on QVHighlights and Charades-STA benchmarks.

03

Demonstrates robustness to irrelevant background context.

Abstract

We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis