Self-Supervised Equivariant Scene Synthesis from Video
Cinjon Resnick, Or Litany, Cosmas Hei{\ss}, Hugo Larochelle, Joan, Bruna, Kyunghyun Cho

TL;DR
This paper introduces a self-supervised method for scene understanding from video that automatically separates background, characters, and animations, enabling real-time manipulation and synthesis of unseen scene combinations.
Contribution
It is the first to perform unsupervised extraction and synthesis of interpretable scene components like background, characters, and animations from video data.
Findings
Successfully applied to Moving MNIST, 2D video game sprites, and Fashion Modeling datasets.
Enables real-time manipulation of scene components.
Achieves unsupervised, interpretable scene decomposition.
Abstract
We propose a self-supervised framework to learn scene representations from video that are automatically delineated into background, characters, and their animations. Our method capitalizes on moving characters being equivariant with respect to their transformation across frames and the background being constant with respect to that same transformation. After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components. As far as we know, we are the first method to perform unsupervised extraction and synthesis of interpretable background, character, and animation. We demonstrate results on three datasets: Moving MNIST with backgrounds, 2D video game sprites, and Fashion Modeling.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Computer Graphics and Visualization Techniques
