TL;DR
The paper introduces SDGAN, a novel graph-based model for temporal video grounding that combines static and dynamic features, query-aware alignment, and multi-granularity proposals to improve localization accuracy.
Contribution
SDGAN is the first to jointly exploit static and dynamic features, perform query-aware alignment, and incorporate multi-granularity proposals with progressive training for TVG.
Findings
SDGAN outperforms existing methods on three benchmark datasets.
Joint static and dynamic features enhance visual representation.
Query-clip contrastive learning improves query-aware localization.
Abstract
Temporal Video Grounding (TVG) aims to localize temporal moments in an untrimmed video that semantically correspond to given natural language queries. Recently, Graph Convolutional Networks (GCN) have been widely adopted in TVG to model temporal relations among video clips and enhance contextual reasoning by constructing clip-level graphs. Despite their effectiveness, existing GCN-based TVG methods encounter three critical bottlenecks: 1) Most methods construct graph nodes using either static or dynamic features alone, resulting in incomplete visual representation and overlooking complementary semantics, 2) Most methods construct temporal graphs in a query-agnostic manner, leading to inefficient feature interaction within the temporal graph representation, and 3) Most methods often suffer from a single-granularity semantic matching, while direct training on complex temporal localization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
