STPro: Spatial and Temporal Progressive Learning for Weakly Supervised   Spatio-Temporal Grounding

Aaryan Garg; Akash Kumar; Yogesh S Rawat

arXiv:2502.20678·cs.CV·April 8, 2025

STPro: Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Grounding

Aaryan Garg, Akash Kumar, Yogesh S Rawat

PDF

Open Access

TL;DR

This paper introduces STPro, a progressive learning framework that enhances weakly supervised spatio-temporal video grounding by incrementally improving action understanding and scene complexity adaptation, achieving state-of-the-art results.

Contribution

The paper proposes STPro, a novel framework with curriculum learning modules that significantly improve weakly supervised spatio-temporal grounding performance.

Findings

01

Achieves state-of-the-art results on three benchmarks.

02

Improves accuracy by 1.0% on VidSTG-Declarative.

03

Improves accuracy by 3.0% on HCSTVG-v1.

Abstract

In this work we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis