Towards End-to-End Generative Modeling of Long Videos with   Memory-Efficient Bidirectional Transformers

Jaehoon Yoo; Semin Kim; Doyup Lee; Chiheon Kim; Seunghoon Hong

arXiv:2303.11251·cs.CV·June 1, 2023·1 cites

Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers

Jaehoon Yoo, Semin Kim, Doyup Lee, Chiheon Kim, Seunghoon Hong

PDF

Open Access 1 Repo

TL;DR

This paper introduces Memory-efficient Bidirectional Transformer (MeBT), enabling efficient, parallel, end-to-end long video generation by capturing long-term dependencies with linear complexity, surpassing autoregressive models in quality and speed.

Contribution

The paper presents a novel bidirectional transformer architecture that achieves linear complexity for long video modeling, allowing fast, parallel decoding of entire videos from partial observations.

Findings

01

Significant improvement in video quality over autoregressive models

02

Faster inference times due to linear complexity

03

Effective modeling of long-term dependencies in videos

Abstract

Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Ugness/MeBT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Advanced Image Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Softmax · Label Smoothing · Byte Pair Encoding · Residual Connection