Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers
Jaehoon Yoo, Semin Kim, Doyup Lee, Chiheon Kim, Seunghoon Hong

TL;DR
This paper introduces Memory-efficient Bidirectional Transformer (MeBT), enabling efficient, parallel, end-to-end long video generation by capturing long-term dependencies with linear complexity, surpassing autoregressive models in quality and speed.
Contribution
The paper presents a novel bidirectional transformer architecture that achieves linear complexity for long video modeling, allowing fast, parallel decoding of entire videos from partial observations.
Findings
Significant improvement in video quality over autoregressive models
Faster inference times due to linear complexity
Effective modeling of long-term dependencies in videos
Abstract
Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Advanced Image Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Softmax · Label Smoothing · Byte Pair Encoding · Residual Connection
