Factorized Video Autoencoders for Efficient Generative Modelling

Mohammed Suhail; Carlos Esteves; Leonid Sigal; Ameesh Makadia

arXiv:2412.04452·cs.CV·June 13, 2025

Factorized Video Autoencoders for Efficient Generative Modelling

Mohammed Suhail, Carlos Esteves, Leonid Sigal, Ameesh Makadia

PDF

Open Access

TL;DR

This paper introduces a factorized autoencoder with a four-plane latent space for efficient high-dimensional video modeling, enabling faster and memory-efficient generative tasks while maintaining high-quality reconstructions.

Contribution

The paper presents a novel four-plane factorized autoencoder that scales sublinearly with input size, improving efficiency for video generative modeling.

Findings

01

Retains high-fidelity reconstructions despite heavy compression

02

Enables faster and memory-efficient video generation with latent diffusion models

03

Supports various conditional generation tasks like class-conditional generation, frame prediction, and interpolation

Abstract

Latent variable generative models have emerged as powerful tools for generative tasks including image and video synthesis. These models are enabled by pretrained autoencoders that map high resolution data into a compressed lower dimensional latent space, where the generative models can subsequently be developed while requiring fewer computational resources. Despite their effectiveness, the direct application of latent variable models to higher dimensional domains such as videos continues to pose challenges for efficient training and inference. In this paper, we propose an autoencoder that projects volumetric data onto a four-plane factorized latent space that grows sublinearly with the input size, making it ideal for higher dimensional data like videos. The design of our factorized model supports straightforward adoption in a number of conditional generation tasks with latent diffusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Image and Signal Denoising Methods · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings