Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos
Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna

TL;DR
This paper introduces SVG2, a large-scale synthetic video scene graph dataset, and a novel model TRaSER that efficiently generates spatio-temporal scene graphs, significantly improving relation detection and aiding video question answering.
Contribution
The paper presents SVG2, a comprehensive synthetic dataset for spatio-temporal scene graphs, and TRaSER, a new model that efficiently produces these graphs from videos, enhancing downstream tasks.
Findings
TRaSER improves relation detection by +15 to 20%.
TRaSER enhances object prediction by +30 to 40%.
Scene graphs improve video question answering accuracy.
Abstract
We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Human verification of SVG2 annotation accuracy confirms its reliability (objects: 93.8%, attributes: 88.3%, relations: 85.4%). Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Human Pose and Action Recognition
