Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding
Akash Kumar, Zsolt Kira, Yogesh Singh Rawat

TL;DR
This paper introduces CoSPaL, a novel self-paced learning framework that enhances weakly supervised spatio-temporal video grounding by integrating spatio-temporal prediction, contextual understanding, and progressive training to overcome limitations of existing models.
Contribution
It proposes CoSPaL, a new approach combining tubelet phrase grounding, contextual referral, and self-paced training to improve weakly supervised video grounding performance.
Findings
Enhanced temporal prediction accuracy
Improved understanding of complex queries
Better adaptation to difficult scenarios
Abstract
In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Multimodal Machine Learning Applications
MethodsFocus
