Patch-based Object-centric Transformers for Efficient Video Generation

Wilson Yan; Ryo Okumura; Stephen James; Pieter Abbeel

arXiv:2206.04003·cs.CV·June 22, 2022·1 cites

Patch-based Object-centric Transformers for Efficient Video Generation

Wilson Yan, Ryo Okumura, Stephen James, Pieter Abbeel

PDF

Open Access 1 Repo

TL;DR

This paper introduces POVT, a region-based video transformer that models object-centric information for efficient and controllable video generation, outperforming or matching existing models while being more scalable.

Contribution

The paper proposes a novel object-centric transformer architecture that leverages bounding boxes for efficient, scalable, and controllable video generation.

Findings

01

Achieves comparable or better performance than existing models.

02

Improves training efficiency through object-centric representations.

03

Enables object-centric controllability via bounding box manipulation.

Abstract

In this work, we present Patch-based Object-centric Video Transformer (POVT), a novel region-based video generation architecture that leverages object-centric information to efficiently model temporal dynamics in videos. We build upon prior work in video prediction via an autoregressive transformer over the discrete latent space of compressed videos, with an added modification to model object-centric information via bounding boxes. Due to better compressibility of object-centric representations, we can improve training efficiency by allowing the model to only access object information for longer horizon temporal information. When evaluated on various difficult object-centric datasets, our method achieves better or equal performance to other video generation models, while remaining computationally more efficient and scalable. In addition, we show that our method is able to perform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wilson1yan/povt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Advanced Image Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Adam · Label Smoothing · Softmax · Byte Pair Encoding · Dropout · Residual Connection