STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding
Aaryan Garg, Akash Kumar, Yogesh S Rawat

TL;DR
This paper introduces STPro, a progressive learning framework that enhances weakly supervised spatio-temporal video grounding by incrementally improving action understanding and scene complexity adaptation, achieving state-of-the-art results.
Contribution
The paper proposes STPro, a novel framework with curriculum learning modules that significantly improve weakly supervised spatio-temporal grounding performance.
Findings
Achieves state-of-the-art results on three benchmarks.
Improves accuracy by 1.0% on VidSTG-Declarative.
Improves accuracy by 3.0% on HCSTVG-v1.
Abstract
In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
