Grounded Video Situation Recognition
Zeeshan Khan, C.V. Jawahar, Makarand Tapaswi

TL;DR
This paper introduces VideoWhisperer, a three-stage Transformer model for grounded video situation recognition, improving entity captioning and localization of verb-roles in videos with weak supervision.
Contribution
It proposes a novel spatio-temporal grounding approach integrated into a multi-stage Transformer for structured video understanding under weak supervision.
Findings
Significant improvement in entity captioning accuracy.
Effective localization of verb-roles without grounding annotations during training.
Operates on multiple events simultaneously for comprehensive understanding.
Abstract
Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities. This task poses several challenges in identifying, disambiguating, and co-referencing entities across multiple verb-role pairs, but also faces some challenges of evaluation. In this work, we propose the addition of spatio-temporal grounding as an essential component of the structured prediction task in a weakly supervised setting, and present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions. In stage one, we learn contextualised embeddings for video features in parallel with key objects that appear in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Natural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Adam · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Multi-Head Attention · Label Smoothing · Absolute Position Encodings · Layer Normalization
