Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning
Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu,, Yi Yang, Yueting Zhuang, Xin Eric Wang

TL;DR
This paper introduces a new task and datasets for evaluating compositional generalization in temporal video grounding, revealing current models' limitations and proposing a structured variational reasoning framework that improves generalization.
Contribution
The paper presents a novel compositional temporal grounding task, new datasets Charades-CG and ActivityNet-CG, and a structured variational cross-graph reasoning model that enhances compositional generalization.
Findings
State-of-the-art methods fail on compositional generalization tasks.
Proposed model outperforms baselines in compositional generalization.
New datasets enable systematic evaluation of compositional generalization.
Abstract
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence. Thanks to the semantic diversity of natural language descriptions, temporal grounding allows activity grounding beyond pre-defined classes and has received increasing attention in recent years. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, current temporal grounding datasets do not specifically test for the compositional generalizability. To systematically measure the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. Evaluating the state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Human Pose and Action Recognition
