Unified Graph Structured Models for Video Understanding
Anurag Arnab, Chen Sun, Cordelia Schmid

TL;DR
This paper introduces a unified graph neural network model for video understanding that explicitly captures spatio-temporal relationships, improving performance on action detection and scene graph classification tasks.
Contribution
It presents a generalized message passing graph neural network that models relationships in videos, allowing for explicit or implicit object representations and analyzing the impact of design choices.
Findings
Achieves state-of-the-art results on AVA, UCF101-24, and Action Genome datasets.
Effectively models relationships between scene entities.
Demonstrates improved relational reasoning in videos.
Abstract
Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals. In this paper, we propose a message passing graph neural network that explicitly models these spatio-temporal relations and can use explicit representations of objects, when supervision is available, and implicit representations otherwise. Our formulation generalises previous structured models for video understanding, and allows us to study how different design choices in graph structure and representation affect the model's performance. We demonstrate our method on two different tasks requiring relational reasoning in videos -- spatio-temporal action detection on AVA and UCF101-24, and video scene graph classification on the recent Action Genome dataset -- and achieve state-of-the-art results on all three datasets. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsGraph Neural Network
