Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Ziqi Gao; Jieyu Zhang; Wisdom Oluchi Ikezogwo; Jae Sung Park; Tario G. You; Daniel Ogbu; Chenhao Zheng; Weikai Huang; Yinuo Yang; Winson Han; Quan Kong; Rajat Saini; Ranjay Krishna

arXiv:2602.23543·cs.CV·March 9, 2026

Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Ziqi Gao, Jieyu Zhang, Wisdom Oluchi Ikezogwo, Jae Sung Park, Tario G. You, Daniel Ogbu, Chenhao Zheng, Weikai Huang, Yinuo Yang, Winson Han, Quan Kong, Rajat Saini, Ranjay Krishna

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces SVG2, a large-scale synthetic video scene graph dataset, and a novel model TRaSER that efficiently generates spatio-temporal scene graphs, significantly improving relation detection and aiding video question answering.

Contribution

The paper presents SVG2, a comprehensive synthetic dataset for spatio-temporal scene graphs, and TRaSER, a new model that efficiently produces these graphs from videos, enhancing downstream tasks.

Findings

01

TRaSER improves relation detection by +15 to 20%.

02

TRaSER enhances object prediction by +30 to 40%.

03

Scene graphs improve video question answering accuracy.

Abstract

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Human verification of SVG2 annotation accuracy confirms its reliability (objects: 93.8%, attributes: 88.3%, relations: 85.4%). Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
UWGZQ/TRASER
model· 35 dl· ♡ 4
35 dl♡ 4

Datasets

UWGZQ/Synthetic_Visual_Genome2
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Human Pose and Action Recognition