Seer: Language Instructed Video Prediction with Latent Diffusion Models

Xianfan Gu; Chuan Wen; Weirui Ye; Jiaming Song; Yang Gao

arXiv:2303.14897·cs.CV·April 28, 2026·6 cites

Seer: Language Instructed Video Prediction with Latent Diffusion Models

Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, Yang Gao

PDF

1 Repo 1 Video

TL;DR

Seer is a novel, efficient video prediction model that leverages pretrained text-to-image diffusion models and a new instruction decomposition technique to generate high-quality, instruction-aligned videos with less data and computation.

Contribution

The paper introduces Seer, a new framework that adapts pretrained diffusion models for text-conditioned video prediction, incorporating a novel instruction decomposition module and efficient attention mechanisms.

Findings

01

Seer achieves 31% FVD improvement over SOTA on SSv2.

02

Seer reduces GPU hours from 12,480 to 480 compared to CogVideo.

03

Seer attains 83.7% average preference in human evaluations.

Abstract

Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We enhance the U-Net and language conditioning model by incorporating computation-efficient spatial-temporal attention. Furthermore, we introduce a novel Frame Sequential Text Decomposer module that dissects a sentence's global instruction into temporally aligned sub-instructions, ensuring precise integration into each frame of generation. Our framework allows us to effectively leverage the extensive prior knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

seervideodiffusion/SeerVideoLDM
github

Videos

Seer: Language Instructed Video Prediction with Latent Diffusion Models· slideslive