Grounded Video Situation Recognition

Zeeshan Khan; C.V. Jawahar; Makarand Tapaswi

arXiv:2210.10828·cs.CV·October 21, 2022·6 cites

Grounded Video Situation Recognition

Zeeshan Khan, C.V. Jawahar, Makarand Tapaswi

PDF

Open Access 1 Video

TL;DR

This paper introduces VideoWhisperer, a three-stage Transformer model for grounded video situation recognition, improving entity captioning and localization of verb-roles in videos with weak supervision.

Contribution

It proposes a novel spatio-temporal grounding approach integrated into a multi-stage Transformer for structured video understanding under weak supervision.

Findings

01

Significant improvement in entity captioning accuracy.

02

Effective localization of verb-roles without grounding annotations during training.

03

Operates on multiple events simultaneously for comprehensive understanding.

Abstract

Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatio-temporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Grounded Video Situation Recognition· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Multi-Head Attention · Label Smoothing · Absolute Position Encodings · Layer Normalization