Video Generation with Predictive Latents

Yian Zhao; Feng Wang; Qiushan Guo; Chang Liu; Xiangyang Ji; Jian Zhang; Jie Chen

arXiv:2605.02134·cs.CV·May 5, 2026

Video Generation with Predictive Latents

Yian Zhao, Feng Wang, Qiushan Guo, Chang Liu, Xiangyang Ji, Jian Zhang, Jie Chen

PDF

TL;DR

This paper introduces PV-VAE, a predictive video VAE that enhances video generation by encoding temporal dynamics, achieving faster convergence and better quality than previous models.

Contribution

The paper proposes a novel predictive reconstruction objective for video VAEs, improving latent space temporal coherence and generative performance.

Findings

01

52% faster convergence compared to baseline

02

34.42 FVD improvement on UCF101

03

Enhanced downstream video understanding performance

Abstract

Video Variational Autoencoder (VAE) enables latent video generative modeling by mapping the visual world into compact spatiotemporal latent spaces, improving training efficiency and stability. While existing video VAEs achieve commendable reconstruction quality, continued optimization of reconstruction does not necessarily translate into improved generative performance. How to enhance the diffusability of video latents remains a critical and unresolved challenge. In this work, inspired by principles of predictive world modeling, we investigate the potential of predictive learning to improve the video generative modeling. To this end, we introduce a simple and effective predictive reconstruction objective that unifies predictive learning with video reconstruction. Specifically, we randomly discard future frames and encode only partial past observations, while training the decoder to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.