CVA: Context-aware Video-text Alignment for Video Temporal Grounding
Sungho Moon, Seunghun Lee, Jiwan Seo, Sunghoon Im

TL;DR
CVA is a comprehensive framework that improves video-text alignment for temporal grounding by combining novel data augmentation, contrastive loss, and a hierarchical transformer architecture, achieving state-of-the-art results.
Contribution
The paper introduces three innovative components—QCD, CBD loss, and CTE—that together enhance the robustness and accuracy of video-text temporal grounding.
Findings
Achieves approximately 5-point improvement in Recall@1 scores.
Outperforms existing methods on QVHighlights and Charades-STA benchmarks.
Demonstrates robustness to irrelevant background context.
Abstract
We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
