Factorized Video Autoencoders for Efficient Generative Modelling
Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia

TL;DR
This paper introduces a factorized autoencoder with a four-plane latent space for efficient high-dimensional video modeling, enabling faster and memory-efficient generative tasks while maintaining high-quality reconstructions.
Contribution
The paper presents a novel four-plane factorized autoencoder that scales sublinearly with input size, improving efficiency for video generative modeling.
Findings
Retains high-fidelity reconstructions despite heavy compression
Enables faster and memory-efficient video generation with latent diffusion models
Supports various conditional generation tasks like class-conditional generation, frame prediction, and interpolation
Abstract
Latent variable generative models have emerged as powerful tools for generative tasks including image and video synthesis. These models are enabled by pretrained autoencoders that map high resolution data into a compressed lower dimensional latent space, where the generative models can subsequently be developed while requiring fewer computational resources. Despite their effectiveness, the direct application of latent variable models to higher dimensional domains such as videos continues to pose challenges for efficient training and inference. In this paper, we propose an autoencoder that projects volumetric data onto a four-plane factorized latent space that grows sublinearly with the input size, making it ideal for higher dimensional data like videos. The design of our factorized model supports straightforward adoption in a number of conditional generation tasks with latent diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Image and Signal Denoising Methods · Generative Adversarial Networks and Image Synthesis
MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
