Generative Video Transformer: Can Objects be the Words?

Yi-Fu Wu; Jaesik Yoon; Sungjin Ahn

arXiv:2107.09240·cs.LG·July 21, 2021·5 cites

Generative Video Transformer: Can Objects be the Words?

Yi-Fu Wu, Jaesik Yoon, Sungjin Ahn

PDF

Open Access 1 Video

TL;DR

This paper introduces the Object-Centric Video Transformer (OCVT), a memory-efficient, unsupervised model that decomposes scenes into objects for improved long-term video generation and scene understanding.

Contribution

The paper presents a novel object-centric approach for video transformers, enabling unsupervised learning of complex dynamics and efficient training on longer videos.

Findings

01

OCVT outperforms RNN-based and baseline transformers in future frame generation

02

OCVT achieves state-of-the-art on the CATER scene reasoning task

03

Model trains on videos up to 70 frames with a single 48GB GPU

Abstract

Transformers have been successful for many natural language processing tasks. However, applying transformers to the video domain for tasks such as long-term video generation and scene understanding has remained elusive due to the high computational complexity and the lack of natural tokenization. In this paper, we propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer. By factoring the video into objects, our fully unsupervised model is able to learn complex spatio-temporal dynamics of multiple interacting objects in a scene and generate future frames of the video. Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU. We compare our model with previous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Generative Video Transformer: Can Objects be the Words?· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Layer Normalization · Softmax · Dense Connections · Adam