TL;DR
Seer is a novel, efficient video prediction model that leverages pretrained text-to-image diffusion models and a new instruction decomposition technique to generate high-quality, instruction-aligned videos with less data and computation.
Contribution
The paper introduces Seer, a new framework that adapts pretrained diffusion models for text-conditioned video prediction, incorporating a novel instruction decomposition module and efficient attention mechanisms.
Findings
Seer achieves 31% FVD improvement over SOTA on SSv2.
Seer reduces GPU hours from 12,480 to 480 compared to CogVideo.
Seer attains 83.7% average preference in human evaluations.
Abstract
Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We enhance the U-Net and language conditioning model by incorporating computation-efficient spatial-temporal attention. Furthermore, we introduce a novel Frame Sequential Text Decomposer module that dissects a sentence's global instruction into temporally aligned sub-instructions, ensuring precise integration into each frame of generation. Our framework allows us to effectively leverage the extensive prior knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
