(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Anoop Cherian; Chiori Hori; Tim K. Marks; Jonathan Le Roux

arXiv:2202.09277·cs.CV·March 29, 2022·1 cites

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering

Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux

PDF

Open Access 1 Video

TL;DR

This paper introduces a (2.5+1)D spatio-temporal scene graph representation for video question answering, leveraging inferred 3D structures and hierarchical reasoning to improve performance and efficiency.

Contribution

The paper proposes a novel (2.5+1)D scene graph model that integrates static and dynamic object information with a transformer-based reasoning pipeline for video QA.

Findings

01

Faster training and inference with the (2.5+1)D representation

02

Superior performance on NExT-QA and AVSD-QA datasets

03

Effective hierarchical spatio-temporal reasoning

Abstract

Spatio-temporal scene-graph approaches to video-based reasoning tasks, such as video question-answering (QA), typically construct such graphs for every video frame. These approaches often ignore the fact that videos are essentially sequences of 2D "views" of events happening in a 3D space, and that the semantics of the 3D scene can thus be carried over from frame to frame. Leveraging this insight, we propose a (2.5+1)D scene graph representation to better capture the spatio-temporal information flows inside the videos. Specifically, we first create a 2.5D (pseudo-3D) scene graph by transforming every 2D frame to have an inferred 3D structure using an off-the-shelf 2D-to-3D transformation module, following which we register the video frames into a shared (2.5+1)D spatio-temporal space and ground each 2D scene graph within it. Such a (2.5+1)D graph is then segregated into a static…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques