Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding
Juncheng Li, Siliang Tang, Linchao Zhu, Wenqiao Zhang, Yi Yang,, Tat-Seng Chua, Fei Wu, Yueting Zhuang

TL;DR
This paper introduces a new benchmark and a novel variational cross-graph reasoning framework for temporal grounding, emphasizing structured semantics to improve compositional generalization in video-language understanding.
Contribution
It proposes a variational cross-graph reasoning model with adaptive structured semantics learning to enhance compositional generalization in temporal grounding tasks.
Findings
State-of-the-art methods perform poorly on new compositional queries.
The proposed approach significantly improves generalization to novel word combinations.
Structured semantic graphs are crucial for compositional reasoning in videos.
Abstract
Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. This task has achieved significant momentum in the computer vision community as it enables activity grounding beyond pre-defined activity classes by utilizing the semantic diversity of natural language descriptions. The semantic diversity is rooted in the principle of compositionality in linguistics, where novel semantics can be systematically described by combining known words in novel ways (compositional generalization). However, existing temporal grounding datasets are not carefully designed to evaluate the compositional generalizability. To systematically benchmark the compositional generalizability of temporal grounding models, we introduce a new Compositional Temporal Grounding task and construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. When…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
Methodsfail
