Generative Video Transformer: Can Objects be the Words?
Yi-Fu Wu, Jaesik Yoon, Sungjin Ahn

TL;DR
This paper introduces the Object-Centric Video Transformer (OCVT), a memory-efficient, unsupervised model that decomposes scenes into objects for improved long-term video generation and scene understanding.
Contribution
The paper presents a novel object-centric approach for video transformers, enabling unsupervised learning of complex dynamics and efficient training on longer videos.
Findings
OCVT outperforms RNN-based and baseline transformers in future frame generation
OCVT achieves state-of-the-art on the CATER scene reasoning task
Model trains on videos up to 70 frames with a single 48GB GPU
Abstract
Transformers have been successful for many natural language processing tasks. However, applying transformers to the video domain for tasks such as long-term video generation and scene understanding has remained elusive due to the high computational complexity and the lack of natural tokenization. In this paper, we propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer. By factoring the video into objects, our fully unsupervised model is able to learn complex spatio-temporal dynamics of multiple interacting objects in a scene and generate future frames of the video. Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU. We compare our model with previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Layer Normalization · Softmax · Dense Connections · Adam
