What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in   Untrimmed Multi-Action Videos from Narrated Instructions

Brian Chen; Nina Shvetsova; Andrew Rouditchenko; Daniel Kondermann,; Samuel Thomas; Shih-Fu Chang; Rogerio Feris; James Glass; Hilde Kuehne

arXiv:2303.16990·cs.CV·May 30, 2024·1 cites

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann,, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a self-supervised framework for spatio-temporal grounding in untrimmed videos using only video and subtitle data, and presents a new dataset for evaluation.

Contribution

It proposes a novel multimodal self-supervised approach combining local and global representations for grounding without human annotations.

Findings

01

Improved grounding accuracy over baselines

02

Effective in spatial, temporal, and multi-action scenarios

03

New benchmark dataset with dense annotations

Abstract

Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

brian7685/STG
noneOfficial

Datasets

CVML-TueAI/grounding-YT-dataset
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization