Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal   Video Grounding

Akash Kumar; Zsolt Kira; Yogesh Singh Rawat

arXiv:2501.17053·cs.CV·March 18, 2025

Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding

Akash Kumar, Zsolt Kira, Yogesh Singh Rawat

PDF

Open Access 1 Video

TL;DR

This paper introduces CoSPaL, a novel self-paced learning framework that enhances weakly supervised spatio-temporal video grounding by integrating spatio-temporal prediction, contextual understanding, and progressive training to overcome limitations of existing models.

Contribution

It proposes CoSPaL, a new approach combining tubelet phrase grounding, contextual referral, and self-paced training to improve weakly supervised video grounding performance.

Findings

01

Enhanced temporal prediction accuracy

02

Improved understanding of complex queries

03

Better adaptation to difficult scenarios

Abstract

In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding· slideslive

Taxonomy

TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Multimodal Machine Learning Applications

MethodsFocus