SSGVS: Semantic Scene Graph-to-Video Synthesis
Yuren Cong, Jinhui Yi, Bodo Rosenhahn, Michael Ying Yang

TL;DR
This paper introduces SSGVS, a novel framework that uses semantic scene graphs to guide video synthesis, enabling explicit temporal control and improved generation quality.
Contribution
The paper proposes a new video synthesis method leveraging semantic scene graphs and a specialized encoder, enhancing temporal guidance and prediction for unlabeled frames.
Findings
Outperforms existing models on Action Genome dataset
Effectively encodes and predicts scene graph representations
Demonstrates the importance of scene graphs in video synthesis
Abstract
As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Linear Layer · Softmax · Adam · Absolute Position Encodings · Byte Pair Encoding
