SSGVS: Semantic Scene Graph-to-Video Synthesis

Yuren Cong; Jinhui Yi; Bodo Rosenhahn; Michael Ying Yang

arXiv:2211.06119·cs.CV·November 18, 2022

SSGVS: Semantic Scene Graph-to-Video Synthesis

Yuren Cong, Jinhui Yi, Bodo Rosenhahn, Michael Ying Yang

PDF

Open Access

TL;DR

This paper introduces SSGVS, a novel framework that uses semantic scene graphs to guide video synthesis, enabling explicit temporal control and improved generation quality.

Contribution

The paper proposes a new video synthesis method leveraging semantic scene graphs and a specialized encoder, enhancing temporal guidance and prediction for unlabeled frames.

Findings

01

Outperforms existing models on Action Genome dataset

02

Effectively encodes and predicts scene graph representations

03

Demonstrates the importance of scene graphs in video synthesis

Abstract

As a natural extension of the image synthesis task, video synthesis has attracted a lot of interest recently. Many image synthesis works utilize class labels or text as guidance. However, neither labels nor text can provide explicit temporal guidance, such as when an action starts or ends. To overcome this limitation, we introduce semantic video scene graphs as input for video synthesis, as they represent the spatial and temporal relationships between objects in the scene. Since video scene graphs are usually temporally discrete annotations, we propose a video scene graph (VSG) encoder that not only encodes the existing video scene graphs but also predicts the graph representations for unlabeled frames. The VSG encoder is pre-trained with different contrastive multi-modal losses. A semantic scene graph-to-video synthesis framework (SSGVS), based on the pre-trained VSG encoder, VQ-VAE,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · 3D Shape Modeling and Analysis

MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Label Smoothing · Layer Normalization · Linear Layer · Softmax · Adam · Absolute Position Encodings · Byte Pair Encoding