LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen; Nisan Chiprut; Benny Brazowski; Daniel Shalem; Dudu; Moshe; Eitan Richardson; Eran Levin; Guy Shiran; Nir Zabari; Ori Gordon,; Poriya Panet; Sapir Weissbuch; Victor Kulikov; Yaki Bitterman; Zeev Melumian,; Ofir Bibi

arXiv:2501.00103·cs.CV·January 3, 2025·2 cites

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu, Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon,, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian,, Ofir Bibi

PDF

Open Access 1 Repo

TL;DR

LTX-Video is a transformer-based latent diffusion model that efficiently generates high-resolution, temporally consistent videos in real-time by integrating a high-compression Video-VAE with a denoising transformer.

Contribution

The paper introduces a novel holistic approach that combines Video-VAE and denoising transformer into a unified model, enabling faster and higher-quality video generation.

Findings

01

Achieves 5 seconds of 24 fps video at 768x512 resolution in 2 seconds.

02

Supports both text-to-video and image-to-video generation.

03

Outperforms existing models of similar scale in speed and quality.

Abstract

We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer's input to the VAE's input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Lightricks/LTX-Video
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Generative Adversarial Networks and Image Synthesis · Advanced Data Compression Techniques

MethodsLatent Diffusion Model · Diffusion