Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos
Rohith Peddi, Saurabh, Shravan Shanmugam, Likhitha Pallapothula, Yu Xiang, Parag Singla, Vibhav Gogate

TL;DR
This paper introduces a new dataset and methods for generating spatio-temporal world scene graphs from monocular videos, enabling comprehensive understanding of object interactions over time including occluded entities.
Contribution
It formalizes the task of world scene graph generation and proposes three novel methods leveraging 3D reconstruction and temporal reasoning, along with baseline evaluations using vision-language models.
Findings
The dataset ActionGenome4D provides dense annotations for 4D scene understanding.
The proposed methods outperform existing approaches in reasoning about unobserved objects.
Baseline models establish new standards for unlocalized relationship prediction in videos.
Abstract
Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
