(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux

TL;DR
This paper introduces a (2.5+1)D spatio-temporal scene graph representation for video question answering, leveraging inferred 3D structures and hierarchical reasoning to improve performance and efficiency.
Contribution
The paper proposes a novel (2.5+1)D scene graph model that integrates static and dynamic object information with a transformer-based reasoning pipeline for video QA.
Findings
Faster training and inference with the (2.5+1)D representation
Superior performance on NExT-QA and AVSD-QA datasets
Effective hierarchical spatio-temporal reasoning
Abstract
Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. These approaches often ignore the fact that videos are essentially sequences of 2D "views" of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight, we propose a (2.5+1)D scene graph representation to better capture the spatio-temporal information flows inside the videos. Specifically, we first create a 2.5D (pseudo-3D) scene graph by transforming every 2D frame to have an inferred 3D structure using an off-the-shelf 2D-to-3D transformation module, following which we register the video frames into a shared (2.5+1)D spatio-temporal space and ground each 2D scene graph within it. Such a (2.5+1)D graph is then segregated into a static…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
